Elided Branches: 2012

Sunday, December 30, 2012

Make it Easy

One of my overriding principles is: make it easy for people to do the right thing.

This seems like it should be a no-brainer, but it was not always obvious to me. Early in my career I was a bit of a self-appointed build cop. The team I worked on was an adopter of some of the agile/extreme programming principles, and the result of that was a 40+ person team all working against the same code base, which was deployed weekly for 3 distinct business purposes. All development was done against trunk, using feature flags. We managed to do this through the heavy use of automated unit/integration testing; to check code in, you were expected to write tests of course, and to run the entire test suite to successful completion before checking in.

Unsurprisingly, people did this only to a certain level of compliance. It drove me crazy when people broke the build, especially in a way that indicated they had not bothered to run tests before they checked in. So I became the person that would nag them about it, call them out for breaking things, and generally intimidate my way into good behavior. Needless to say, that only worked so well. People were not malicious, but the tests took a LONG time to run (upwards of 4 hours at the worst), and on the older desktops you couldn't even get much work done while the test suite ran. In the 4 hours that someone was running tests another person might have checked in a conflicting change that caused errors; was the first person really supposed to re-merge and run tests for another 4 hours to make sure things were clean? It was an unsustainable situation. All my intimidation and bullying wasn't going to cause perfect compliance.

Even ignoring people breaking the build, this was an issue we needed to tackle. And so we did, taking several months improve the overall runtime and make things easier. We teased out test suites into specific ones for the distinct business purposes combined with a core test suite. We made it so that developers could run the build on distributed hardware from their local machine. We figured out how to run certain tests in parallel, and moved database-dependent tests into in-memory databases. The test run time went way down, and even better, folks could kick off the tests remotely and continue to work on their machine, so there was much less reason to try and sneak in an untested change. And lo and behold, compliance went way up. All the sudden my build cop duties were rarely required, and the whole team was more likely to take on that job rather than leaving it to me.

Make it easy goes up and down the stack, far beyond process improvements. I occasionally find myself at odds with folks that see the purity of implementing certain standards and ignore the fact that those standards, taken to extreme, make it harder for people to do the right thing. One example is REST standards. You can use the http verbs to modify the meanings of your endpoints and make them do different things, and from a computer-brain perspective, this is totally reasonable. But this can be very bad when you must add the human brain perspective to the mix. Recently an engineer proposed that we change some endpoints from being called /sysinfo (which would return OK or DEAD depending on whether a service was accepting requests), and /drain (which would switch the /sysinfo endpoint to always return DEAD), into one endpoint. That endpoint would be /sys/drain. When called with GET, it would return OK or DEAD. When called with PUT, it would act as the old drain.

To me, this is a great example of making something hard. I don't see the http verb, I see the name of the endpoint, and I see the potential for human error. If I'm looking for the status-giving endpoint, I would never guess that it would be the one called "drain", and I would certainly not risk trying to call it to find out. Even knowing what it does, I see myself accidentally calling the endpoint with GET, now I didn't drain my service before restarting it. Or I accidentally called it with PUT and now it's been taken out of the load balancer. To a computer brain, GET and PUT are very different, and hard to screw up, but when I'm typing a curl or using postman to call an endpoint, it's very easy for me as a human to make a mistake. In this case, we're not making it easy for people using the endpoints to do the right thing, we're making it easy for them to be confused, or worse, to act in error. And to what benefit? REST purity? Any quest for purity that ignores human readability does so at its peril.

All this doesn't mean I want to give everyone safety scissors. I generally prefer to use frameworks that force me and my team to do more implementation work rather than making it trivially easy. I want to make the "easy" path the one that forces folks to understand the implementation to a certain level of depth, and encourages using only the tools necessary for the job. This makes better developers of my whole team, and makes debugging production problems more science than magic, not to mention the advantage it gives you when designing for scale and general future-proofing.

Many great engineers are tripped up by human nature, when there's really no need to be. Look at your critical human-involving processes and think: am I making it easy for people to do the right thing here? Can I make it even easier? It might take more work up front on your part, or even more verbosity in your code, but it's worth it in the long run.

Thursday, December 20, 2012

Building a Global, Highly Available Service Discovery Infrastructure with ZooKeeper

This is the written version of a presentation I made at the ZooKeeper Users Meetup at Strata/Hadoop World in October, 2012 (slides available here). This writeup expects some knowledge of ZooKeeper.

The Problem:
Create a "dynamic discovery" service for a global company. This allows servers to be found by clients until they are shut down, remove their advertisement, or lose their network connectivity, at which point they are automatically de-registered and can no longer be discovered by clients. ZooKeeper ephemeral nodes are used to hold these service advertisements, because they will automatically be removed when the ZooKeeper client that made the node is closed or stops responding.

This service should be available globally, with expected "service advertisers" (servers advertising their availability, aka, writers) able to scale to the thousands, and "service clients" (servers looking for available services, aka, readers) able to scale to the tens of thousands. Both readers and writers may exist in any of three global regions: New York, London, or Asia. Each region has two datacenters with a fat pipe between them, and each region is connected to each other region, but these connections are much slower and less tolerant for piping large quantities of data.

This service should be able to withstand the loss of any one entire data center.

As creators of the infrastructure, we control the client that connects to this service. While this client wraps the ZooKeeper client, it does not have to support all of the ZooKeeper functionality.

Implications and Discoveries:
ZooKeeper requires a majority (n/2 + 1) of servers to be available and able to communicate with each other in order to form a quorum, and thus you cannot split a quorum across two data centers and guarantee that the quorum will be available with the loss of any one data center (because at least one data center will fail to have a pure majority of servers). To sustain the loss of a datacenter therefore you must split your cluster across 3 data centers.

Write speed dramatically decreases when the quorum must wait for votes to travel over the WAN. We also want to limit the number of heartbeats that must travel across the WAN. This means that both a ZooKeeper cluster with nodes spread across the globe is undesirable (due to write speed), and a ZooKeeper cluster with members only in one region is also undesirable (because writing clients outside of that region would have to continue to heartbeat over the WAN). Even if we decided to have a cluster in only one region, we would have to solve the problem that no region has more than 2 data centers, and we need 3 data centers to handle the loss/network partition of an entire data center.

Solution:
Create 3 regional clusters to support discovery for each region. Each cluster has N-1 nodes split across the 2 local data centers, with the final node in the nearest remote data center.

By splitting the nodes this way, we guarantee that there is always availability if any one data center is lost or partitioned from the rest of the data centers. We also minimize the affects of the WAN on write speed by ensuring that the remote quorum member is never made into the leader node, and the general effect of the majority of nodes being local means that voting can complete (thus allowing writes to finish) without waiting for the vote from the WAN node in normal operating conditions.

3 Separate Global Clusters, One Global Service:
Having 3 separate global clusters works well for infrastructural reasons mentioned above, but it has the potential to be a headache for the users of the service. They want to be able to easily advertise their availability, and discover available servers preferably by those servers available first in their local region, and secondly in other remote regions if no local servers are available.

To do this, we wrapped our ZooKeeper client in such a way as to support the following paradigm:
Advertise Locally
Lookup Globally

Operations requiring a continuous connection to the ZooKeeper, such as advertise (which writes an ephemeral node) or watch are only allowed on the local discovery cluster. Using a virtual IP address we automatically route connections to the discovery service address of the local ZooKeeper cluster and write our ephemeral node advertisement here.

Lookups do not require a continuous connection to the ZooKeeper, and so we can support global lookups. Using the same virtual IP address we can connect to the local cluster to find local servers, and failing that use a deterministic fallback to remote ZooKeeper clusters to discover remote servers. The wrapped ZooKeeper client will automatically close its connection to the remote clusters after a period of client inactivity, so as to limit WAN heartbeat activity.

Lessons learned:
ZooKeeper as a Service (a shared ZooKeeper cluster maintained by a centralized infrastructure team to support many different clients) is a risky proposition. It is easy for a misbehaving client to take down an entire cluster by flooding it with requests or making too many connections and without a working hard quota enforcement system clients can easily push too much data into ZooKeeper. Since ZooKeeper keeps all of its nodes in memory, a client writing huge numbers of nodes with a lot of data in each can cause ZooKeeper to garbage collect or run out of memory, bringing down the entire cluster.

ZooKeeper has a few hard limits. Memory is a well-known limit, but another limit is the number of sockets for a server process (configured via the ulimit in *nix). If a node runs out of sockets due to too many client connections, it will basically cease to function without necessarily crashing. This is not surprising for anyone that has experienced this problem in other Java servers, but it is worth noting when scaling your cluster.

Folks using ZooKeeper to do this sort of dynamic discovery platform should note that if the services you are advertising are Java services, a long full GC pause can cause their session to the ZooKeeper cluster to time out and thus their advertisement will be deleted. This is generally probably a good thing, because a server that is doing a long-running full GC won't respond to client requests to connect, but it can be surprising if you are not expecting it.

Finally, I often get the question of how to set the heartbeats, timeouts, etc, to optimize a ZooKeeper cluster, and the answer is really that it depends on your network. I really recommend playing with Patrick Hunt's zk-smoketest in your data centers to figure out sensible limits for your cluster.

Sunday, November 18, 2012

On Fit and Emotional Problem Solving

One of the biggest challenges Rent the Runway has is the challenge of getting women comfortable with the idea of renting. That means a lot of things. There's questions of timing, questions of quality. But the biggest question by far is the question of fit. Our business model, if you are unfamiliar, is that you order a dress typically for a 4 day rental period, which means that the dress comes very close to the date of your event, possibly even the day of that event. If it does not fit, or you don't like the way it looks on you, you may not have time to get something else for the occasion. As a woman, this uncertainty can be terrifying. Getting an unfamiliar item of clothing, even in 2 sizes, right before an event important enough to merit wearing something fancy and new is enough to rattle the nerves of even the least fussy women out there. This keeps many women from trying us at all, and presents a major business obstacle.

Given this obstacle, how would you proceed? When I describe my job to fellow (usually male) engineers, and give them this problem in particular, their first instinct is always to jump to a "fit algorithm". I've heard many different takes on how to do 3D modeling, take measurements, use computer vision techniques on photographs in order to perfect an algorithm that will tell you what fits and what doesn't.

Sites have been trying to create "fit algorithms" and virtual fit models for years now, and none has really gained much traction. Check this blog post from 2011, about that year being the year of the "Virtual Fit Assistant". Have you heard of these companies? Maybe, but have you or anyone you know actually USED them?

I would guess that the answer is no. I know that for myself, I find the virtual fit model incredibly off-putting. I trust the fit even less seeing it stretched over that smooth polygon sim that is supposed to be like me. Where are the lumps going to be? Is it really going to fit across my broad shoulders? The current state of 3D technology looks ugly and fake and I'm more likely to gamble on ordering something from a site with nothing but a few measurements or a model picture than one where I can make this fake demo. The demo doesn't sell me, and worse, it undermines my fit confidence, because it doesn't look enough like me or any real person and it makes me wonder how those failures in capturing detail will translate into failures in recommending fit.

I've come to realize in my time at this job that what engineers often forget when faced with a problem is the emotional element of that problem. Fit seems like an algorithmic problem, but for many women, there is a huge emotional component to trying things on. The feel of the fabric. The thrill of something that fits perfectly. The considerations and adjustments for things that don't. Turning fit into a cheesy 3D model strips all emotion from the experience, and puts it into the uncanny valley of not-quite-realness. I do think that someday technology will be able to get through the valley and provide beautiful, aspirational 3D models with which to try on clothes, but we aren't there yet. So what can we do?

At Rent the Runway, we've discovered through data that when you can't try something on, photos of real women in a dress are the next best thing. Don't forget that the human brain is still much more powerful than computers at visual tasks, and it is much easier for us to imagine ourselves in an item of clothing when we see it on many other women. This also triggers the emotional response much more than a computer-generated image. Real women rent our dresses for major, fun, events. They are usually smiling, posing with friends or significant others, looking happy and radiant, and that emotion rubs off on the viewer. It's not the same as trying something on in a dressing room, but it is like seeing a dress on your girlfriend and predicting that the same thing would look fabulous on you.

This insight led us to launch a major new subsite for Rent the Runway called Our Runway. This is a view of our inventory that allows women to shop by photos of other women wearing our dresses. It is driven by data but the selling point is emotional interaction. Learning to use emotional reasoning was a revelation to me, and it might be the most valuable engineering insight I've picked up in the last year.

Sunday, October 14, 2012

Get Better Faster

I heard a very interesting piece of advice this week from my CEO, addressing a group of college students that were visiting our office. Her words went something like this:

"Most days you have 100 things on your to-do list. Most people, when faced with such a list, will find the 97 that are obvious and easy and knock them off before worrying about the 3 big hairy things on the list. Don't do that. The 97 aren't worth your time, it's the 3 big hairy things that matter."

I've been thinking about that bit of wisdom ever since I heard it. It seems counter-intuitive in a way. Anyone that has ever suffered from procrastination knows that sometimes you feel better, more able to tackle problems when you break them down into a todo list. You get little things done and make yourself feel accomplished. But the more I think about the advice from my CEO, the more I agree with it. Especially in an entrepreneurial setting, or in a setting where you are suddenly given far more responsibilities than you are used to having. Why? It all boils down to three little words: Get Better Faster.

Get better faster. That's what I've spent my last year trying to do. I went to a startup to grow, to stretch in ways that I couldn't stretch in the confines of a big company. And when I suddenly found myself running the whole engineering team, this learning doubled its speed overnight. Being an ok manager and a great engineer is no longer enough for me to do my job. I need to be an excellent manager, an inspirational leader, a great strategist and a savvy planner. And the engineer can't totally slack off, but she needs to be saved for the really nasty bugs, not implementing fun new features.

This has all taught me a difficult lesson: you get better faster by tackling the hardest problems you have and ignoring the rest. Delegate the things that are easy for you (read: the little things you do to feel good about your own productivity) to someone who still needs to learn those skills. Immerse yourself in your stretch areas. For me, this mostly means that I have to delegate coding and design details to the engineers working for me, I have to delegate the ownership of my beloved projects and systems to someone with the time to care for them. This is PAINFUL. I would call the last 3 months being mostly out of coding and in planning/management/recruiting land to be some of the hardest of my career.

And yet, I'm doing it. I'm getting better. And it's not just me who is getting better. It's every member of my team that has had to step up, to fill in the empty positions of leadership, to take over the work I can't do, or the work that the person who took work from me can't do.

You'll never get better doing the easy stuff, checking off the small tasks. A savvy entrepreneur knows that the easy stuff can always be done by someone else, so let someone else do it. The hard problems are the problems that matter.

Sunday, September 9, 2012

Becoming the Boss

One of the reasons people go to work for startups is that sense that anything could change at any time. You could go big, you could go bust, you could pivot into a completely new area. About a month ago, I got my first taste of this when, following my boss's departure from the company, I found myself in the role of head of engineering. And what a change this has been.

Call it the Dunning-Kruger effect, or simply call it arrogance, but I think if you had asked me before I was put into this position whether I could do the job well, I would have told you certainly yes. Did I want the job? Not really. But I could totally do it if I had to. Sure, I've never had full responsibility for such a large organization before, but I'm a decent manager, I have leadership skills, and I know my technical shit. That should be enough.

Here is what I have learned in the last month. The difference between leading 6 people in successful completion of their tasks, technical guidance, and the occasional interpersonal issue is nothing like being responsible for 20 people delivering quality releases, keeping their morale up, knowing when things are going wrong in the technical, interpersonal or career sense, and having to additionally report everything to your CEO and heads of business. When there is no buffer in your department above you that people can go to when your guidance is lacking, the weight of that responsibility is 10 times what you ever expected. A sudden transition of leadership even in a solid organization such as ours stirs up long-simmering conflicts. I'm down one pair of ears to listen and mediate.

And then there's recruiting. Helping with recruiting, giving good interviews, and saying good things about the company is nothing like owning the sell process from the moment of first technical contact with the candidate over the phone or at coffee, through onsite interviews, and into a selling stage. Good candidates need to be coaxed, guided, and encouraged often several times before they even get in the door. And one bad interview with you, the head of the department, can sully the name of the whole organization to a person even if you didn't want to hire them. I know this, but that didn't stop me from conducting a terrible phone screen a few days ago where my stress and impatience showed through as rudeness to the candidate. I thought I knew how to recruit but one bad interview and I'm in my CEO's office for some clearly-needed coaching.

I have known for a long time that even in the lesser leadership roles I've held in the past, the things I say and do echo much larger than I expect them to. But that was nothing compared to the echoes from being the person in charge. My stress causes ripples of stress throughout the staff. When I speak harshly to people over technical matters, it is yelling even if I don't intend it to be. One snide comment about a decision or a design invites others to sneer at that decision along with me.

The best advice I've gotten in the past month has been from my mother, who told me simply to smile more. My echo can be turned into echoes of ease and pride and even silliness and fun if I remember to look at the positives as much as the negatives. When I remember to smile, even if I'm unhappy about a decision, I find myself able to discuss that decision without inviting judgement upon the person that made it. When I smile through a phone call with a potential recruit, I sell the company better. When I smile through my 1-1s people feel that they can raise concerns without worrying that I will yell at them. When I smile, I see people step up and they take on bigger responsibilities than they've ever had, and knock them out of the park over and over again, which makes me smile even more. A smile is the thing that keeps me tackling this steep learning curve of leadership. So I try to smile, and every week I learn more than I've learned in a month at this job or a year at my previous company. Because change is scary and hard, but in the long run, it's good.

Monday, August 20, 2012

The Science of Development

Despite the study of computing being termed "Computer Science", most working software developers will agree that development is more art than science. In most application code, the hard decisions are not which algorithm to use or which data structure makes the most sense. Rather, they tend to focus on how to build out a big system in a way that will let lots of people contribute to it, how to build a system such that when you go back to it you can understand what's going on, how to make things that will last longer than 6 months. While occasionally those decisions have a "right" answer, more often they are the type of judgement calls that fall into the art of computer programming.

There are, however, two very important areas in which science, and by science I mean careful observation and logical deduction, play a very important role. These two areas are debugging and performance analysis. And where developers are often good at the art of programming, many of us fall short on the science.

I've been meaning to write a blog post about what makes a person a truly great debugger. This will not be that post, but in thinking about the topic I polled some of the people I believe are great debuggers including my longtime mentor. His response was interesting and boils down the concepts very succinctly. Good debuggers have a scientific bent. Some people, when faced with a bug, will attempt to fix it via trial and error. If changing the order of these statements makes the threading problem go away, the bug must be solved by that change. Good debuggers, on the other hand, know that there is a root cause for errors that can be tracked down, and they are able to break down the problem in a methodical way, checking assumptions and observing behavior through repeated experiments and analysis.

It's a wise point, and one that I must agree with. If I observe myself debugging hard problems, a few things come out. One, I always believe that with enough patience and help I can in fact find the root of the problem, and I pretty much always do. Two, when I dig into a tough problem, I start keeping notes that resemble a slightly disorganized lab notebook. In particular when debugging concurrency problems between distributed systems, to get anywhere sensible you have to observe the behavior of the systems in correct state, and in problem state, and use the side-effects of those correct and incorrect states (usually log files) to build hypotheses about the root cause of the problem.

I've also been thinking lately about this scientific requirement in performance analysis and tuning. We're going through a major performance improvement exercise right now, and one of the things that has caused us challenges is a lack of reproducible results. For example, one of the first changes we made was to improve our CSS by converting it to SCSS. We were relying on New Relic, our site metrics monitor, to give us an idea of how well we improved our performance. This caused us a few problems. One, our staging environment is so random due to network and load that performance changes rarely show up there. Two, New Relic is not a great tool for evaluating certain kinds of client-side changes. Lacking any sort of reliable evaluation of performance changes in dev/staging caused us to make a guess, SCSS would be better, that took a week plus to pan out, and resulted in inconclusive measurements. We were looking for luck, but we need science, and our lack of scientific approach hampered our progress. Now we have taken a step back and put reproducible measures in place, such as timing a set of well-known smoke tests that run very regularly, and using GTMetrix to get a true sense of client-side timings. Our second performance change, minified javascript, has solid numbers behind it and we finally feel certain we're moving in the right direction.

Is there art in debugging and performance tuning? Well, there is the art of instinct that lets you cut quickly through the noise to the heart of a problem. Well-directed trial and error (and well-placed print statements) can do a lot for both debugging and performance analysis. But getting these instincts takes time, and good instincts start from science and cold, hard, repeatable, observable truths.

Sunday, August 12, 2012

Being Right

One of the stereotypes of the computer industry is the person that knows they're right, and doesn't understand when it doesn't matter. I have been this person. I once got into a knock-down-drag-out fight with another engineer over the decision to make an API I had written accept case-sensitive data. He absolutely insisted that it must, even though pretty much any case that would require case-sensitivity would almost certainly mean abusing the system design for nefarious purposes. We argued over email. We argued over the phone. I would be damned if I would give in.

At some point my boss stepped in and told me I had to change it, because this person was one of the most important clients of the API and he didn't want to have strife between our teams. I was furious. I yelled at him. I told him we shouldn't negotiate with terrorists. It didn't make a difference, his decision was final. I made the change, and the other developer proceeded to abuse the system in exactly the ways I had predicted. But he also used it, which was all that mattered to my boss, and in retrospect, was more important for the success of the project than my sense of rightness.

As a manager, I find myself on the other side of the fray, as I negotiate with one of my own developers over doing something "right". The time it takes to do something "right" simply isn't worth the fight I would have to have with product, or analytics, or other members of the tech team. It's not worth the time we would spend debating correctness. (At this point, it's usually ALREADY not worth the time we've spent arguing with each other). It's not worth the testing overhead. It's not going to move the needle on the business.

No matter how much I say, you're right, but it doesn't matter, it doesn't sink in. They don't believe me that I know that they're right, that I agree with their technical analysis but that it's not enough to change my mind. They tell me that they understand that technical decisions are sometimes a series of non-technical tradeoffs, but in this case, why can't I see that this is just the right way to do it?

Here's what I wish my boss had spelled out to me, and what I hope I can explain to my own developers when we run into such conflicts in the future: I know where you are coming from. I have nothing but immense sympathy for the frustration you feel. I know that it seems like a trivial thing, why does that other team feel the need to insist that it would take them too much time to integrate, that they don't want to test it, why does that idiot say that it MUST be case-sensitive? Someday, you will be in my shoes. You will be worn down from fighting for right, and you will be driven to produce the best results with the most consensus and not distract everyone with the overhead of debating every decision. And you'll probably smile and think, well, that dude abused my system but he also drove adoption across the company, and I still resent my boss for not fighting for me, but I understand where she was coming from.

Wednesday, August 1, 2012

Growing New Leaders: A Modest Proposal

I've often thought that the startup industry has a leadership problem. It has a leadership problem thanks to its virtues: small companies, agile processes, minimum viable products. The dirty, tiring work of people leadership is often at odds with the hustle of startup life. And yet we as an industry need to make space for that work, because we're starving for leaders and hurting our companies due to this lack of leadership.

Part of the problem is the myth of the natural leader. We'll promote from within, just take the smartest engineer that is able to communicate adequately and make them team lead. But there is very little that is natural about team management. Management is more than just being articulate in meetings and giving a good presentation now and again. It's learning how to communicate with stakeholders in their language. It's figuring out how to resolve conflicts between personalities on your team, even when one of the personalities is your own. It's the detail work of project planning for two people over the next month and twenty people over the next year. Heck, people go to business school and spend a good chunk of time studying these issues. We often disparage the MBA, but we do very little to replicate that skill channel in our own industry.

How does one end up with these abilities? Well, some read books and try to muddle their way through, with varying degrees of success. An ambitious person might reach out to external parties for help, people they know have been there and can provide advice. But both of these channels pale in comparison to learning leadership in an apprenticeship model. I know because I myself have tried books and advisers, but in the end I've learned most of my leadership skills thanks to hours of one-on-one time with current and former bosses, walking through presentations and plans, discussing personnel issues, perfecting pitches. I'm still learning and growing these skills, having moved from mentoring one or two people, to managing a small team, to working as a director of engineering and technical architect for a growing company. And now it is time for me to grow my own leaders. I've written before about the importance of training your replacement, I wish that as an industry we would all take this step seriously.

Here's my proposal. Let's stop promoting people to the role of senior engineer who have never had the experience of hands-on direction of people and projects. If someone has five to ten years of experience writing software in teams, and they have never had the responsibility of mentoring junior developers in a meaningful way, they are lacking something in their skill set. Let's give them the tools they need to be successful as leaders, taking time from the hustle to teach planning, managing, communicating across teams. This experience might just be a short stopover from individual contributor position to more senior individual contributor position, but it's a necessary part of that career path. If we could get a standard across our industry that everyone that was called senior developer had this experience in mentoring and management, we would have something to work on. It would be very unlikely for a company of 20 people to have no one with prior leadership experience. We could successfully grow our leaders from within knowing that we have a small core of people that has already learned the fundamentals of management.

This training isn't free. It will cost us time that we might think should be spent towards the hustle of delivering. But we know that software is humans, and we know that the payoff for delivering is bigger teams, bigger companies, and more responsibilities to manage. If the title of senior engineer recognizes that you can deliver, it should also recognize that you can handle what happens when you do.

Sunday, July 22, 2012

The Siren Songs of Hack Day Projects

I love hack days. I think spending a day every now and then to just work on whatever inspires you is one of the best trends of the last five years. Whether you use it to try out a new idea, fix a problem that's been bothering you for months, or play with a new tool, it's a great use of engineering time.

However, we can't forget the seductive illusion of the hack. When you have one day to make something, and you decide to make something new, you make it however you can. You don't write tests (unless perhaps you are the most enlightened TDDer). You don't worry about edge cases. You don't plan for scale. You're playing! You're having fun! You're creating something beautiful and impressive that can bring new frontiers to your business!

What's the downside? A good hack day produces ideas and projects that you want to put into production. But when the time comes to actually take those projects and make them production-worthy it's often quite a letdown. At Rent the Runway we've seen this happen twice over our two hack days. On the first, the whole team worked on a project to rebuild "Rent the Runway" on a completely new platform. The story goes that they got 80% of the way through in one 24 hour binge. 80%! This incredible feat actually lead us to say that we would move the entire site off of our existing platform within the next 3-4 months. Heck, if we could get 80% of the site rewritten by the whole team working for a day, surely we could get the site rewritten completely in a few months.

I'm a bit embarrassed to admit that we all bought into the hype of the 80% completion for longer than we should've. In reality, the 80% completion was more like "80% of basic functionality completed", which is still very impressive but ignores the 50% of required functionality that goes well beyond basic. A 3-4 month rewrite timeline might be realistic if you could take the entire engineering team, build no new features, and eventually ship something similar but not quite the same. Of course, none of these things are ever true.

The most recent hack day had some of our engineers build a very cool project utilizing our reviews data. So cool, in fact, that our product team latched on to it and decided to make it into a full-fledged major launch. Success! And now... we're just finishing a couple of intense weeks of project planning for what is quickly turning into a man-year's worth of an engineering project. It will be cool when it launches, but I don't think any of us expected this cool thing that took a day to hack to turn into 4 engineers working for three full months just to get a polished v1.

The hack day siren song can also cause you to fall in love with your hack so much you don't see the cost to making it production-ready. I've known of engineers that went off on their own and neglected their other duties for projects that would never see the light of production due to their complexity and questionable business value. It's hard to complain when people are working on something they're passionate about, but it's important for a team to feel like they are all working toward the same goal, and that can be difficult when some members are spending most of their time tinkering with side projects.

I'm jaded and boring, so my most recent hack day project took some images that were stored in the database and put them on the CDN. It took me longer than anticipated, but I wrote tests and made it work in both production and staging and released it that day. (My expected 2 hour hack still took me closer to 6 hours end to end.) It also enabled other hack day projects, which was really the point. At the end of the day, I'm glad most of the engineers I work with use their hack days to shoot for the moon, even if the final results always take longer to achieve than we hope and maybe spill over past that one day. After all, what's the point in fast-loading images if no one ever views them?

Thursday, July 12, 2012

On Yaks and Hacks

A Yak Story

Today I released a nice little feature into production that tests promoting a customer's "hearted" items to the top of their search results. I decided that it would be a good test after seeing some analytics data on users of this particular feature, and sold it to the head of analytics and myself as a quick feature to spin up, half a day's dev work for me. We already had the data, I just needed to play with the ordering of the products to display.

Of course, it didn't turn out to be quite that simple. When I opened the code and prepared to add the new feature, I discovered a few things. First, my clean checkout of the code base wouldn't even compile. Apparently in the time I had been away from it, the other devs had renamed a maven dependency, rtr_infra_common, to rtr_services_common, only to discover that they needed both. Instead of starting the new rtr_infra_common from, say, version 2, they started it back at version 1.0. My local maven was very, very confused.

Having figured that out, I finally had a clean, compiling workspace, so time to code. I have all the data I need to do this right here, right? Except, this "score" field that claims to be in the DTO for our products is not in fact being serialized by the providing service. Oh and in my absence the package that holds that DTO has also gotten refactored into its own project. And rtr_services_common depends on that. And rtr_infra_common depends on THAT. And I depend on rtr_infra_common. So just to get the new data, I have to update the product catalog service, the DTO, and the whole chain of dependent projects.

As an aside, I think maven is kind of nice, but dependency management systems have caused the biggest yaks I've ever shaved, and those couple of hours of annoyance are a small drop in the bucket.

So, got my data. But now I find another problem. The framework I'm working in was designed to take all of this data and apply experiments to it before returning it to the front end. But we hadn't actually built out a notion of the "user" beyond a userId for logging and bucketing purposes. There was nowhere to put the personalized data I needed to use to enhance the results. I could just save the list of items I needed and add an additional argument to the experiment processor with that data, but I knew that we would want more and different signals in the near future. So I took another couple of hours, refactored all of the code to properly model a user, changed and enhanced all the tests, and finally, finished the job... Except it didn't work. I had forgotten to update the clone method in our DTO to copy the new data I had added. My colleague stepped in for that last bit of belly shaving, as I had now blown my estimate by a good day and my other work was piling up. After going through the irritation of changing 4 projects for 1 line of code, he understood my earlier cursing, and we had our feature finally ready.

A Hack Story

On Wednesday, I get a report that there's a snag in a release that is supposed to go out this week. We're taking some work that used to be done manually by our email marketing staff and automating it. The developer on the project has hit a bug. Apparently, our email service provider does not support UTF-8 encoding, and of course, being a fashion company, we have designer names with the occasional accent that we need to throw around. He's been struggling with this, trying to figure out why escaping isn't working, and finally discovering that only Latin-1 will suffice. Before he sits down to write this re-encoding of everything, we ask our email marketers how they did this in the past. "Oh, we just change them to plain text characters". How many characters are we talking about? "Just an é and a ë". Would it be easier to just convert them to plain text than to Latin-1? Yes. And so we did.

Yaks vs Hacks

I don't like hacks, but they can be expedient. Instead of spending time going back and forth with our ESP we will get this manual email automated and out the door, and worry about our text encoding problems later. Shaving this yak buys us nothing at this point.

On the other hand, I thought my new feature would be a quick hack that could give us a bunch of interesting data, and it turned into a minor yak (or perhaps just a shaggy water buffalo) that took a bit too long but was not quite a full-blown project. I couldn't hack the dependency problems, and hacking the code restructuring would have resulted in threading an arbitrary parameter through a bunch of unrelated code just to avoid thinking about the right solution. The upside of the yak shaving was more data, a better code base for making such changes in the future, and a recognition of a serious dependency structure problem that we need to address. So today I celebrate Yak Shaving Day. I hope it only comes once a year!

Thursday, July 5, 2012

Moneyball

The first thing they always do is quiz you. When big tech companies look for employee prospects, the first test is always basic knowledge. The Googles of the world have a checklist. "Coverage Areas" is what they call the talents they're checking for in a candidate. The big three are Coding, Algorithms and Analysis, Systems Design. Maybe they will also care about Communication skills, and Cultural Fit. They put you through the paces. Can she code? Does she know her Big-O analysis, her data structures? Can she design a distributed system to mimic Twitter? Tick, tick, tick. Pass enough of the checkboxes, and they'll make an offer. Nice salary, great benefits, the chance to work on big problems with these intimidatingly smart folks. When it comes to the deep pockets and well-oiled recruiting machine of a big tech company, it can be hard for a startup to compete for talent.

I was thinking about this a lot as I watched Moneyball last night. It got me thinking. I wish there was a sabermetrics for developers. After all, what is a startup but a cash-strapped team trying to win against the Yankees? This is an imperfect analogy, but let me break it down.

First a step back for those of you that haven't watched the movie and don't give a flip about baseball. Sabermetrics is basically a system that tries to predict, through past performance, how well a player will do in the future on certain metrics. In contrast to tests of basic skills (running, throwing, fielding, hitting, power) that talent scouts look for, sabermetrics focuses on the statistics of a player's actual game performance. Instead of going broke on a couple of big-name players, as many teams would, or gambling on hyped up fresh "potential", the Oakland As used this system to draft cheap players that together ended up forming a competitive team against the A-list roster of the New York Yankees.

The thing about baseball that makes it a useful analogy for many startups is that baseball is a team sport where the size of the team is fixed and the potential of the team in aggregate matters. The same is true for a tech team. We see so many stories about how you should only hire "A" players, that they attract other A players, and they out-perform other developers 10-1 (or 50-1 or 100-1 depending on how grandiose you're feeling). But how do you find these A players? The Derek Jeters are probably settled, and cost more than you can afford. Google and its kin have deep enough pockets to hire anyone that ticks off all the "Coverage Areas", know that they will get lucky and find some natural As in that bunch, train up some more, and the rest will be good enough to take on the mountains of other work needed at such a huge company.

The thing that Google intentionally neglects, and the thing you absolutely can't afford to ignore, is past performance in the game. What is the tech equivalent of on-base percentage? Product shipped. This is why everyone drools over open source developers as hiring potential. You can see their work, you can know that they've shipped. Teasing this out of others is a tough process. We don't have visibility into the work that most people have done. Writing code for you proves that they have the tools to deliver, and is an essential part of the interview process, but it's not enough to show that they actually have delivered. All we can do is ask them, as a detailed part of their interview, what they have done.

I realize that relying on reports of past experience as part of the hiring process is fraught. I would never suggest you neglect technical vetting in favor of experience. It's easy to get burned by someone whose personality you like and who sounds smart, only to find out they don't have the basic skills to succeed. And even with the best interviewers in the world, it's always possible that you will still hire someone that chokes at the plate. In the case where you find yourself with a player that isn't shipping, you have to take to heart the other thing that you can do more easily than Google: get rid of underperformers. It sucks, and no one enjoys letting people go, but a startup cannot afford to keep people that aren't delivering.

Will this result in your team being full of A-list all stars? Maybe. Or you may find yourself with a team where few people are exceptional standouts, but everyone delivers. When everyone in your roster can reliably get on base, you're winning. You're shipping. And that's what the game is about.

Wednesday, June 20, 2012

Code Reviews, Code Stories

Code reviews are a tool that incites a lot of strong emotions from developers. Those who love them write tweets like

Code review is hard - both reviewing and being reviewed make people cranky. But I know of no better way to make great software. -- @yminsky

This tweet, even in its adoration, still captures some of what people hate about code reviews. They make people cranky, to put it mildly. I personally don't like code reviews very much. In my experience, they are just as likely used to bully, score points, or waste time on pedantic style notes as they are to produce great software. In fact, I think code reviews are absolutely necessary only in three kinds of situations:

1) You don't have good automated testing practices in place to catch most errors at or before checkin.
2) You work on a big project with a very geographically distant team of developers (most open source).
3) You work in a company of many developers and huge code bases where strict style conformity is very important.

In my company, we have one team that falls into slot number 1, and they do code reviews for all checkins. The rest of the team does ad-hoc code reviews and occasional code walkthroughs, and we rely on the heavy use of testing to ensure quality. In general I feel that this enables a good mix of code correctness and checkin velocity.

The one thing this misses, though, is the required opportunity for developers to learn from each other. I like pair programming, and encourage my developers to do it when they are first starting projects or figuring out tricky bugs. But a lot of the time my developers are working in parallel on different code bases, and I felt that we were missing that learning moment that code reviews and pairing can provide. So I started asking people in interviews how they encouraged their team to work together and learn from each other, and I got a great suggestion: encourage the developers to show off work that they're proud of to other team members. Thus the Power Hour was born.

"Power Hour" is an hour that we do every Friday morning with my team and our sister team of warehouse developers. In that hour every developer pairs up with someone else and they take turns sharing something they did that week. Show off something you're proud of, ask for help with a problem you're stuck on, or get feedback on the way something was or is being designed. The power and choice of what to show lies in the hands of the person showing it, which takes away the punitive nature of code reviews; all they are required to do is to share some code. Show off, get help/feedback, and then switch. It's not about code review, that line-by-line analysis and search for errors big and small. It's about code stories, whether they are war stories, success stories, or stories still being written.

There are a lot of features of code review that this Power Hour doesn't satisfy. It's not a quiet opportunity to evaluate someone else's code without them looking over your shoulder. It's driven by the person that wrote the code and they are encouraged to highlight the good parts and maybe even completely bypass the ugly parts. It's not possible for the other person to fully judge the quality of what they're seeing, because judging isn't the point at all. Power Hour is a learning and sharing opportunity, and it is explicitly not an occasion for judging.

This has turned out to be a smashing success. It's amazing how much you can teach and learn in a concentrated hour of working with another person. Last week I showed off a new Redis-based caching implementation I wrote, and got to see all the code and a demo of a new Shiro authentication and entitlement system. I caught a few things I still needed to clean up in the process of explaining my work, and so did my partner. If you, like me, shun formal code reviews, I encourage you to try out a weekly turn at code stories. It's a great way to build up teamwork, encourage excellence, and produce great software, without the crankiness.

Thursday, June 14, 2012

There Is No Number

Is it time to quit your longtime job? It's pretty easy to tell. Have you gone out and interviewed? Gotten job offers? Seriously considered them? Then you should quit. It's that simple. Not quitting now is delaying the inevitable and robbing yourself of the opportunity to grow.

I've seen this time and again, and gone through it myself. It's hard to leave a longtime job, especially when you are treated holistically well. Great perks, good pay, a big network, lots of friends. You may hate how things are done but you know how to get them done. It's scary to think about leaving that comfort zone. And when you bring up the possibility of leaving, either speculatively or actively trying to resign, it's likely that your boss will promise you up and down that they will move mountains for you. A new project! More responsibility! Leadership opportunities! Your pick of teams! Maybe even the biggest enticement out there: more money.

Take a lesson from Peggy Olsen. For those Mad Men fans out there, you know that Don will never change. What made him a great boss for her, a mentor, a leader, was also keeping her boxed in. He wasn't changing, and she needed to grow. Your Don could be your actual manager, but more likely it's the company culture and dynamics that are beyond the ability for one person to change. Be real: will more money make you excited to get out of bed in the morning? Will it make you feel like your contributions are truly important? Will it make you feel like you're actually growing and learning new things?

Sometimes, there is no number.

If that isn't enough to get you moving, here's one more the thing about leaving a long-time job, especially one at a big company: if you are leaving on good terms, there's a very good chance you can go back if things don't work out. A friend of mine recently did just that. After leaving her job at a larger tech company due to malaise with the work and a bad fit with her team, she went to a very small startup to do mobile development in a related genre. When she realized that the startup was just not managed well enough to succeed she reached out to her old employer and was hired back, happier than ever having both learned some new tricks and gotten into a new team.

Throw back your drink, and do the deed. It will probably feel like breaking up with a long-term partner, and it may be one of the most stressful things you've ever done, but it's also a good bet that you will end up better off in the long run. If nothing else, when the deed is done, I bet you will leave with a smile.

Saturday, June 9, 2012

Corporate culture: A lesson in unintended consequences

It's been a little more than six months since I left finance to join Rent the Runway. Upon hearing that I moved from a big investment bank to a small startup, many people comment on how different it must be. And it is, but in ways than you might not expect.

The day-to-day aspects of the job are not that different. On the technical front, I'm designing SOA-structured systems, thinking about building a messaging-based data bus, worrying about how to model business workflow into software and making sure we can process everything we need to process in time. I have more freedom to choose the technologies I want to use but I'm also more constrained by limited resources to manage the complexity of additional systems. The giant internal infrastructural support organization I had for hardware and level one support is replaced by a giant cloud vendor that does pretty much the same thing but with crappier hardware and faster turnaround. The people-related work is mostly unchanged. I still use the skills I learned for building consensus across teams, coaching people in their careers, and managing up, down, and laterally. I run plenty of meetings, make plenty of schedules, and continue my struggle balancing management and organization with architecture and coding. My schedule starts and ends slightly later in the day, but even my hours are not much different.

Unlike some that have left finance, I don't hold any major grudges against it. I learned a lot in my six plus years working there, about technology, leadership, and working with people. So given that the work is similar, why leave the better paycheck and general stability for the uncertainty of startup life? I'm much happier now than I was in my last years in finance, and the reason for that is not technical, but cultural. When I started in finance, the culture was surprisingly startup-like. It was more formal, but each team was dedicated to the business they served and we were able to work fairly autonomously in support of that business. Over time, though, the culture changed to become more centralized and thus bureaucratic. Instead of picking the right thing for your business area, you had to get approval from people that had nothing but a vested interest in seeing their pet technology promoted in order to get themselves promoted. That of course led to politically-driven technology which is an absolute nightmare for a person that cares deeply about building great systems.

Ironically, I believe this happened at least partly because the company was concerned with building a formal technical career path in order to promote and retain technical talent. At first, this resulted in senior technical people getting much-deserved recognition and it was great. I wanted nothing more than to reach that highest level myself. But soon the highest-level title ("technology fellow") had been awarded to the backlog of people that truly deserved it, and it became something that was expected to be awarded to more people every year; each group needed to have one, and it was a cheap way to recognize and retain people in years when salaries were stagnant since it didn't necessarily confer any increase in compensation. This growing group of people, many qualified by deep work in one small area, were now expected to help guide the technology decisions of the whole company. And so a bureaucracy was born. Decisions that would have been made via influence within a group and business area started to be dictated by committees far away from that group. The fellows started, unintentionally, to squash creativity of those below them on that same technical track they were supposed to be growing.

As bad as that was, in my opinion the biggest problem with this role was that it was not equivalent to the same level title for managers, "Managing Director". Because the role of fellow became more of an honorary recognition than an actual promotion, it came to pass that the savviest fellows would also get themselves promoted to Managing Director (MD), and thus become first among equals. No longer did the mere tech fellow have the strongest voice in the technical decision making; everyone started striving to become more than just a fellow, and in order to get that MD promotion they needed the backing of the other tech fellows that were also Managing Directors. As we all know, the best technologists are not necessarily the best politicians and decisions started being guided more and more by those better at politics than technology. In this way, an effort to promote and retain technologists ended up building a technical culture of decision making driven by politics and bureaucracy.

At the end of the day, finance is an industry that cares first about money (duh), and secondly about power, and that is true for every successful person that works there in any capacity. Technical qualifications will only ever be rewarded as they make or save money. If you don't find finance fascinating, or care about money above all other things, this cultural friction may grind you down. I've discovered the truth in the wisdom that you enjoy your job more when you're working on a product that you use and enjoy. Rent the Runway has a different goal than my old investment bank; yes, we want to make money, but we want to make it by delivering a great experience to our customer, an experience that I participate in and really love. And that is a goal that I know I can build a great technical culture around.

Thursday, May 31, 2012

War Stories: Guava, Ehcache, Garbage Collection

We're in the process of moving all of our major business logic out of our clunky Drupal frontend to Java backend services, and we took another big step down that road this week by moving all of the logic for filtering of our product grids to our new integration service. This release was the culmination of months of work and planning that started at the beginning of the year, and it gets us over a major functionality hump. The results are looking good, we've saved almost 3s average page load time for this feature. Yes, that's right, three seconds per page load.

As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.

The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out.

Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.

At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?

I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class.

As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:

But not until you get to the third part of the doc, and we are even lazier than Guava in our evaluation. So, instead of caching the list as it is returned from Lists.transform, we call Lists.newArrayList on the result and cache that. Finally, problem solved.

The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.

Sunday, May 27, 2012

Hammers and Nails: Managing Complexity

I've been thinking a lot about complexity lately. As a systems developer at heart, I love complexity. I love complex tools that have enormous power when wielded properly. In my last position I designed systems to provide infrastructure software services to thousands of developers and systems running around the globe, and I enjoyed the process of finding the best tools for that job no matter how esoteric. I prided myself in looking beyond good enough, and was rewarded for building things that would last for years, even if they took years to build.

Now that I work at a small startup, I have begun to view complexity in a very different way. I'm rebuilding critical business infrastructure with at most a couple of developers per project and an ops team of two for all of our systems. In this scenario, while we have the freedom to choose whatever stack we want to use, for every component we choose we have to weigh complexity and power against the cost of developer time and operational overhead.

Take Play as an example of a framework whose simplicity was a major selling point. While we ended up scrapping it (and I would advise against using the 1.X branch in production), I do not think that trying it in the first place was a bad idea given what we knew at the time. Using Play, all of our developers were able to get projects created, working against the data stores, and tested in very little time. I still have developers ask me why they can't just start up new projects in Play instead of having to to use one of the other Java frameworks that we've moved forward with, even though the new frameworks have relatively little more complexity. I love Dropwizard to death, and find it to be a good replacement for Play, but it doesn't support JPA out of the box yet, it requires just a bit more thought to get a new project up and running, and even this minimal additional complexity is enough to slow everyone down a noticeable amount on new projects. Every bit of thought we have to put out there is mental overhead that takes away from delivery velocity.

Another painful example of unexpected complexity happened this past week. We are moving our stack towards a service-oriented (SOA) model, and as part of that move we have load balancers set up in front of our various services. These load balancers are provided by our cloud hardware vendor, and have a hard limit of 30s per request. Any request that runs longer than 30s will be killed by the load balancer and send an unspecified 500 to our storefront. We have a best practice that our staging environment should be run in the same way as our production environment, but when it comes to setting up load balancers we often forget about this policy, and we released a new major service migration that called some very heavyweight db queries now behind a layer of load balancers. So, we do the release, and immediately test the most intensive requests that could be run. Some of these requests fail, with "unexpected" 500s. Surprise surprise, we forgot about the load balancer, and not only that, but everyone that was doing the release forgot about the sniping and was mystified by the behavior. The load balancer is just a tiny bit of added complexity, but it was enough to scuttle a release and waste hours of development time.

All this thinking about complexity has come to a head lately as I have been pondering our current storage platforms and the choices I have made in what to use. I have chosen, in this case, to minimize upfront complexity, and I've chosen to go with MongoDB. Moreover, I believe that, of all the NoSQL stores that I have personal experience with out there (HBase, Cassandra, MongoDB), MongoDB is the mostly likely to be long-term, widely successful. Why? It's not the most powerful solution, it's probably not the most performant solution, and it currently has some quirks that have caused many people grief. But the CTO of 10Gen, Eliot Horowitz, is laser-focused on creating a system that is easy for developers to use and reason about. His philosophy is that developer time is the biggest fixed cost in most organizations, and I think from small startups to big companies that is generally true. Why do most companies use SQL-based RDBMS systems? They often have some serious limitations and challenges around scalability, and yet they are the first thing most people turn to when looking for a data storage solution. But you can pick almost any developer up off the street and they will be able to find tools to be productive in a SQL-based environment, you can hire an ops person to maintain it, and you can find answers to all your questions easily without having to fully understand the implications of the CAP theorem. And so it goes with Mongo.

You can manage your developer dollar spend in lots of ways. If you are a big company, you can afford to hire a small, dedicated team to manage certain elements of complexity. You can simply try to hire only the absolute best developers, pay them top dollar, and expect them to learn whatever you throw at them. And you can do your best to architect systems that balance complexity and power tradeoffs with developer complexity overhead. The last is not easy to do. Simplicity often comes with a hidden price tag (as we found with Play 1.X), and it's hard to know when you're buying in to hype or speeding towards a brick wall. Of course, not every problem is a nail. But, as a startup, you probably can't outspend your problems, so before you buy the perfect tool to solve your next issue, make sure you can't at least stun it for a while with a good whack.

Thursday, May 17, 2012

Process Debt and Team Scalability

Most of us in tech spend time thinking about technical debt. Whether it's a short-term loan or a massive mortgage, we've all seen the benefits and costs of managing and planning a code base with technical debt, and we are generally familiar with ways to identify, measure and eliminate this debt.

What about process debt? Process debt is a lot like technical debt, in that it can be a hack or a lack. The lack of process as debt is pretty easy to see. For example, never bothering to create an automated continuous build for your project. That's a fine piece of process to avoid when you have only one or two developers working on a code base, but as your team grows, your lack of process will start to take its toll in broken code and wasted developer time.

There are also plenty of ugly process hacks. Take for example the problem of a team of varying skill levels, and a code base that is increasingly polluted by poorly formatted, totally unreadable code. When faced with this problem you may be tempted to institute a rule where every line of code must be reviewed by a teammate. Later you discover that the two people with the worst style are reviewing each other, and things still aren't improving enough. So you declare that everyone must have their code reviewed by one of the senior developers that you have appointed. Now you have better style, but at the cost of everyone's productivity, especially your senior developers.

Adding code reviews was the easiest path to take. You just hacked in some process instead of paying up front to think about what the best process would be. But process, once added, is hard to remove. This is especially true for process intended to alleviate risk. Maybe you realize after a few months that the folks with questionable style have learned the ropes, and you remove the rule. But what if you added code reviews not to fix a style problem, but to act as a risk mitigant? I've done this myself in the case of a code base with very little automated testing and a string of bad releases. I'm now paying for my technical debt (a lack of automated testing) by taking out a loan on process.

An easy way to identify process debt is to ask yourself the following question: Will this solution still work when I have twice as many developers? What about ten times as many? A hundred? Process debt, like technical debt, makes scaling very difficult. If every decision requires several meetings and a sign-off, if every git pull is a roll of the dice for your continued productivity, or if you have to hire three project managers for every five developers just to keep track of the task list, you're drowning yourself in process debt.

Developers are wary of process because we too often experience process debt of the hack sort. I believe that both a lack of process and an excess of process can cause problems, but if we begin to look at process debt with the same awareness and calculations that we apply to technical debt, we can refactor our playbook the way we refactor our code and end up with something functional, simple, and maybe even elegant.

Sunday, May 13, 2012

Budgeting for Error

What's your uptime SLA over the last month, six months, year? Do you know off the top of your head? Is your kneejerk response, "as close to 100% as possible"? Consider this: by not knowing your true current SLA, you not only turn a blind eye to a critical success metrics for your systems, you also remove the ability to budget within the margins of that metric.

There are about 8765 hours in a year. How many of those hours do you believe your code absolutely needs to be up to keep your business successful? 99% of the time buys you almost 88 hours of downtime over the course of a year. 99.9% of the time still buys you almost 9 hours. Even 99.99% gives you about 52 minutes a year that your systems can be down. Think of what you can do with these minutes. Note that the much-vaunted "5 9s" reliability (99.999%) breaks down to 8 minutes over the course of a year, which is great if you're Google or the phone company, but probably not a smart goal for your average startup.

Let's say you know that your deployment process is rock-solid without outage and you will never need planned hardware downtime due to the way you've architected your systems. But you also know that you have some risky features that you want to push now, before you announce a critical partnership that should result in a big membership bump. If you're sure that the bump won't cause downtime, you might choose to push the features and risk some downtime in smoothing out rough edges on the code so that you have a really compelling site for those new members.

On the other hand, if you know that you're pretty solid under your current load but a 50% increase in usage has the potential for some degree of system failure, your error budget might not accommodate both the risky new features and the membership growth. And if your business pushes you to do both the risky new features and the growth risk? Make sure they know that your SLA may suffer as a consequence. When you know your goal SLA, and you know something is likely to reduce or violate it, that's a strong signal that you should think carefully about the risks of the project. This can also be a useful negotiation tool when being pushed to implement a feature you don't think is ready for prime time. When they say we need to release this new feature today, which means at least two hours of downtime that pushes you out of SLA, it becomes their job to get authorization from the CTO instead of your job to convince them why it is a bad idea.

I will admit that I do not currently have an uptime SLA for my services. Up until recently, it never occurred to me that there would be any value to trying to pin down a number and measure to it. As a result, while liveness and stability is always a consideration, I haven't taken the time to think through the rest of the year when it comes to hardware upgrades, new features, or deployment risk as measured by likely downtime impact. I'm missing out on a key success metric for my infrastructure.

Once I've managed to nail down a course-grained uptime SLA for my systems, the next phase of this work is to nail down a more fine-grained response time SLA. Of 100 requests, what is the 95th percentile response time from my infrastructure services? This is much trickier than a simple uptime SLA due to the interaction of multiple systems each with their own SLAs. For now though, I need to focus on the big picture.

Wednesday, May 2, 2012

Intuition, Effort, and Debugging Distributed Systems

I recently watched this great talk by Coda Hale, "The Programming Ape". It's heavily influenced by Thinking Fast and Slow, a book about cognitive processes and biases. One of the major points of the book, and the talk, is that we have two types of thinking: intuitive thinking, which is fast, easy, creative, and sloppy, and attention-based thinking, which is harder, but more accurate.

One of the great points that Coda makes in his talk is that most of the ways we do things in software development are very attention-heavy. At the most basic level, writing correct code requires a level of sustained attention that none of us possesses 100% of the time, which is why testing (particularly automated testing) is such an essential part of quality software development. Attention doesn't stop when you get the code into production, you still have the problem of monitoring, which often comes in the form of inscrutable charts or messages that take a lot of thought to parse. Automation helps here, but as anyone that has ever silenced a Nagios alert like a too-early alarm clock knows, the current state of automation has limits when it hits up against our attention.

By far the most attention-straining thing I do on a regular basis is debugging distributed systems. Debugging anything is a very attention-heavy process; even if you have good intuition about where the problem may lie, you still have to read the code, possibly step through it in a debugger or read through a log output and try to find the error. Debugging errors in the interaction between distributed systems is several times more difficult. A debugger is often of no help, at least not initially, because you have to get a series of events to happen in a particular way to trigger the bug. Identifying that series of events in most cases requires staring hard at a series of log files and/or system state dumps, and trying to piece together the ordering based on timestamps that may slightly differ between systems. I consider myself to be a very good debugger and it still took me a solid 4 hours of deep concentration, searching through and replaying transaction logs before I was able to crack through this particular bug. I would never hold the ZooKeeper code base up as a paragon of debugability, but what can we do to make this easier?

When you're writing a distributed system, think hard about what you log. This may be impossible to always get right, but so often the only way you have to find that bug is log files from around the time it happened. If you're going to reconstruct a series of events, you need evidence that those events happened, and you need to know when they happened. Should you rely on the clocks of the machines to line up enough to put the time series together, and should you fail the system if the clocks are too far apart? Since it's a distributed system, is there a way for all of the members of the system to agree on a clock that you could use for logging? As for the events themselves, it is important to be able to easily identify them, their particular behavior, and the state they are associated with (the session that made this request, for example).

One of the problems with ZooKeeper logs, for example, is that they don't do a great job of highlighting important events and state changes. Look at this, does it make your eyes glaze over immediately?

Events are hidden towards the ends of lines, in the middle of output (type:setData, type:create). Important identifiers are held in long hex strings like 0x773516a5076a0000, and it's hard to remember which server/connection they are associated with. To debug problems I have to rely on pen/paper or notepad records of what session id goes with what machine and what the actual series of events was on each of the quorum peers. Very little is scannable and it makes debugging errors a very tedious and attention-heavy process.

Ideally, we want to partially automate debugging. To do this, the logs have to be written in a form that an automated system could parse and reason about. Perhaps we should log everything as JSON. There's a tradeoff though, now a human debugger probably needs another tool to parse the log files at all. This might not be a bad thing. Insisting on basic text for logging leaves out the huge potential win of formatting that can draw the eye to important information in ways other than just text.

Are there tools out there now to aid in distributed debugging? A quick google search shows several scholarly papers and not much else, but I would guess that given the ever-increasing growth of distributed systems we'll see some real products in this area soon. In the meantime, we're stuck with our eyes and our attention, so we might as well think ahead about how we can work with our intuitive systems instead of against them.

Thursday, April 26, 2012

Scaling in the Small

Last week I wrote about scaling and some examples of recent scaling successes. That post has been in my mind this week as I've been thinking about what I need to be doing here at Rent the Runway to enable our systems to scale.

Unlike the developers at Instagram or Draw Something, I'm not coming to this problem from a fresh code base with a very small, focused team. Instead, I'm taking a system that has been in production for almost two and a half years and trying to correct a series of short-sighted decisions made over that time, all while supporting a thriving business with more feature requests and ideas than a team of a hundred devs could handle. The problem is in some ways smaller than your typical scaling story: I don't have to handle anywhere near the volume of a popular app. But I can't afford to make changes that would bring the business to a halt for hours, lose orders, or block our warehouse. So we replace the jet engine one bolt at a time, all while flying over the Atlantic Ocean.

Our scaling problems arise mostly from the decision, in the initial website implementation, to create our site in Drupal and have most of the business logic live in PHP code that resides in that Drupal server. This was a bad choice. Drupal is not an e-commerce platform, and using it as such is a nightmare. The business logic, entangled with the view logic, is not easy to test, and every change to that code requires a re-release of the whole site. Drupal forced certain database modeling decisions that are almost impossible to reason about, and the end result of all of this is a rat king of PHP, SQL, and CSS/JS. Even if we could scale our site by throwing money at database servers, we can't scale our feature velocity in such a system.

So the first scaling problem I've been tackling in my time here has not been the ability to handle more users on the site. It's been the ability for our developers to actually deliver more features in a timely manner without nightmarish releases. This effort was started by my boss before I joined, and eight months later, we're really starting to see the fruits of our labor.

Our first attempt was to pull heavy logic behind our reservation system out into a Java service. In the process, we learned exactly how daunting the task ahead of us would be. Have you ever been tempted to say, this won't be too hard to rewrite, only to eat your words? We ate our words several times over in this project. From the migration and reconciliation of old data, to the performance of the new logic, to the integration with our warehouse code, we got hammered over and over. This was a system that had to be replaced all at once, with no opportunity for iterative releases. The new Java service, while not perfect, was fairly straight-forward to write, test, and debug. But the PHP code that had to change to use this service was untested, and we made the fatally stupid mistake of letting it remain so throughout the development process. The project ended up taking several months and almost the entire team to complete.

Learning our lesson from that, we've moved to some pieces of the system that we can replace iteratively: storing more of our product metadata inside of MongoDb and providing a service layer to serve new data while slowly moving existing functionality onto that system. This has been much easier than the first project, launching three new features over three weeks all while improving site performance and sparking a creative revival among the developers. When you have fast and easy access to your data, and you can release whenever you feel good about your testing, you turn your feature scaling from being throttled by your platform to being throttled by your own creativity and developer hours.

So today, I gave our developers a talk about the long-term (9-18 month) goals for our system design. It's straightforward: continue to move business logic out of Drupal, write services that are horizontally scalable, use some judicious vertical scaling where needed. Cache read-only data as much as possible, think about sharding points, plan for more users who stay longer and interact more heavily. And as we think about user growth in addition to feature growth, don't forget to add performance testing to our bag of tricks. It's great to be here, and to recognize the scaling problems we have already conquered to get to this point.

Wednesday, April 18, 2012

Scaling: It's Not What It Used To Be

We've seen a lot of interesting stories lately about companies scaling massively from just a few users to millions in a very short period of time. Contrary to expectation, some of them weren't even initally designed for scaling. Take, for example, Instagram (a fabulous slide deck, btw). Their first strategy for scaling was vertical: buy a bigger database server. That didn't last too long, and they found themselves doing horizontal scaling (via sharding) after a few months. But instead of writing their own custom database solution, or using one of the many NoSQL options out there, they stuck with Postgres and via a clever virtual sharding strategy combined with Redis and Memcached, they managed to scale.

Then there's Draw Something. While the engineers designed it to scale, they had no idea that it would hit hundreds of drawings per second in its first few weeks. In contrast to Instagram, they ended up rewriting their backend data store in Membase during their growth spurt, but that combined with sharding and adding hardware (both vertically via more RAM and SSDs, and horizontally via more machines) got them through their growth period.

10 years ago, scaling to millions of users would have probably taken a large, smart team with months of advance planning to achieve. As a point of reference, Livejournal had about 1.5 million active users per month in 2005, a very popular site at the time, and memcached was created to handle its load. I'm sure that the developers at Draw Something, Instagram and other popular tech start ups are very talented, but it's more than talent that is allow this kind of unplanned growth to happen. The common points between these two teams are not what I was expecting when I sat down to write this post. So what are they?

1. Redis. I was at a NoSQL meetup last night when someone asked "if you could put a million dollars behind one of the solutions presented here tonight, which one would you choose?" And the answer that one of the participants gave was "None of the above. I would choose Redis. Everyone uses one of these products and Redis."
2. Nginx. Your ops team probably already loves it. It's simple, it scales fabulously, and you don't have to be a programmer to understand how to run it.
3. HAProxy. Because if you're going to have hundreds or thousands of servers, you'd better have good load balancing.
4. Memcached. Redis can act as a cache but using a real caching product for such a purpose is probably a better call.
And finally:
5. Cloud hardware. Imagine trying to grow out to millions of users if you had to buy, install, and admin every piece of hardware you would need to do such a thing.

Redis, Nginx, HAProxy, Memcached. All free open-source software. AWS and other cloud vendors let any of us grab machines to run our software at the touch of a button. The barriers to entry have gone way down, and we've all discovered that scaling, while hard, isn't nearly as much of a monster as we made it out to be. It's truly an exciting time to be a software engineer.

Thursday, April 12, 2012

Debug Your Career: Ask for Advice

One of the most powerful relationship-building tools I have found in my career is the simple act of asking for advice. It's also something that I don't see a lot of people take advantage of, especially in tech. In some cases, it's an ego thing. What if they think I'm weak for not knowing, for not figuring this out on my own? Sometimes, it's a desire not to be a bother to the other person. Sometimes it just doesn't occur to you to seek outside help. But when you don't ever ask others for real advice (more than simply, "how does this code work anyway?", or, "what was that cool hack you cooked up the other day?") you leave a lot of potential on the table.

It shows you respect the other person's opinion (and thus, the other person)
We all want to believe that our opinion is important, and when you are actually asked for your opinion it is a token of respect for you as a person. You've heard that people like those that like them back? People tend to respect those that respect them back. After all, you have such good judgement in respecting them!
This really comes to play when you ask people that work for you or are junior to you for their advice, especially when you follow it or use it to open a dialogue on a bigger topic. As a technical leader this will usually come in the form of advice about technology, and let's be honest: none of us has the time to be a true expert in all of the technologies we use in our job. Why would you hoard the decision making on all technical matters? Your employees probably come to you with suggestions and opinions already, but taking care to specifically ask them for their advice is an important way to ensure they know that you respect them and the knowledge they posses.

It makes them a partner in your success
Think about a time when you've helped someone solve a particularly hard bug they were facing. A bug that was meaty and fun and made you think and also made you feel awesome and smart for being involved in solving it. Afterwards, did you feel some ownership stake in the success of that project? This is the feeling you give people when you ask them for their advice, especially when you take it and turn it into success.
Now think about a time when you've hit a bug in someone else's code and they have ignored your attempts to help them debug it, or worse, totally ignored your patches and pull requests. You probably feel a bit more animosity towards the project, and if it fails, perhaps it deserves to fail.
This set of feelings is analogous to proactively going to your manager and asking for advice on how to grown in an area, versus having to be told by your manager that you are failing. In the first scenario, you make your manager a partner in your success, and they feel pride and sense of ownership when you succeed. In the second case, you are at best an area for improvement and a worst a burden or even a write-off.

It forces you to think about the problem you really have
You shouldn't ask someone for advice without knowing what you're looking for. Just as you're not likely to get much help debugging if all you know is that the program crashes, simply asking someone "Hey, any advice?" will probably get you little better than a cliche like... "ask for advice". Really valuable advice doesn't come from emailing Kevin Systrom and asking how you can get a billion dollars by putting your Hadoop cluster in a Cray-1. It's delivered when you tell your boss you're struggling with the challenge of managing a team and writing code at the same time, and how does he manage to keep an eye on everyone without micromanaging or spending every waking hour watching GitHub and Jira?

Of course, this comes with some stipulations. The act loses its power if you're just doing it for the sake of doing it. You have to be willing to at least consider taking the advice you are given, and you don't want to waste people's time. But if you take the time to truly think about areas where you'd like to improve, and go to people honestly asking for advice on how to do so, you may find that an extra set of eyeballs makes the bugs in your career shallow.

Buy My Book, "The Manager's Path," Available Now!