Buy My Book, "The Manager's Path," Available March 2017!

Thursday, April 26, 2012

Scaling in the Small

Last week I wrote about scaling and some examples of recent scaling successes. That post has been in my mind this week as I've been thinking about what I need to be doing here at Rent the Runway to enable our systems to scale.

Unlike the developers at Instagram or Draw Something, I'm not coming to this problem from a fresh code base with a very small, focused team. Instead, I'm taking a system that has been in production for almost two and a half years and trying to correct a series of short-sighted decisions made over that time, all while supporting a thriving business with more feature requests and ideas than a team of a hundred devs could handle. The problem is in some ways smaller than your typical scaling story: I don't have to handle anywhere near the volume of a popular app. But I can't afford to make changes that would bring the business to a halt for hours, lose orders, or block our warehouse. So we replace the jet engine one bolt at a time, all while flying over the Atlantic Ocean.

Our scaling problems arise mostly from the decision, in the initial website implementation, to create our site in Drupal and have most of the business logic live in PHP code that resides in that Drupal server. This was a bad choice. Drupal is not an e-commerce platform, and using it as such is a nightmare. The business logic, entangled with the view logic, is not easy to test, and every change to that code requires a re-release of the whole site. Drupal forced certain database modeling decisions that are almost impossible to reason about, and the end result of all of this is a rat king of PHP, SQL, and CSS/JS. Even if we could scale our site by throwing money at database servers, we can't scale our feature velocity in such a system.

So the first scaling problem I've been tackling in my time here has not been the ability to handle more users on the site. It's been the ability for our developers to actually deliver more features in a timely manner without nightmarish releases. This effort was started by my boss before I joined, and eight months later, we're really starting to see the fruits of our labor.

Our first attempt was to pull heavy logic behind our reservation system out into a Java service. In the process, we learned exactly how daunting the task ahead of us would be. Have you ever been tempted to say, this won't be too hard to rewrite, only to eat your words? We ate our words several times over in this project. From the migration and reconciliation of old data, to the performance of the new logic, to the integration with our warehouse code, we got hammered over and over. This was a system that had to be replaced all at once, with no opportunity for iterative releases. The new Java service, while not perfect, was fairly straight-forward to write, test, and debug. But the PHP code that had to change to use this service was untested, and we made the fatally stupid mistake of letting it remain so throughout the development process. The project ended up taking several months and almost the entire team to complete.

Learning our lesson from that, we've moved to some pieces of the system that we can replace iteratively: storing more of our product metadata inside of MongoDb and providing a service layer to serve new data while slowly moving existing functionality onto that system. This has been much easier than the first project, launching three new features over three weeks all while improving site performance and sparking a creative revival among the developers. When you have fast and easy access to your data, and you can release whenever you feel good about your testing, you turn your feature scaling from being throttled by your platform to being throttled by your own creativity and developer hours.

So today, I gave our developers a talk about the long-term (9-18 month) goals for our system design. It's straightforward: continue to move business logic out of Drupal, write services that are horizontally scalable, use some judicious vertical scaling where needed. Cache read-only data as much as possible, think about sharding points, plan for more users who stay longer and interact more heavily. And as we think about user growth in addition to feature growth, don't forget to add performance testing to our bag of tricks. It's great to be here, and to recognize the scaling problems we have already conquered to get to this point.

Wednesday, April 18, 2012

Scaling: It's Not What It Used To Be

We've seen a lot of interesting stories lately about companies scaling massively from just a few users to millions in a very short period of time. Contrary to expectation, some of them weren't even initally designed for scaling. Take, for example, Instagram (a fabulous slide deck, btw). Their first strategy for scaling was vertical: buy a bigger database server. That didn't last too long, and they found themselves doing horizontal scaling (via sharding) after a few months. But instead of writing their own custom database solution, or using one of the many NoSQL options out there, they stuck with Postgres and via a clever virtual sharding strategy combined with Redis and Memcached, they managed to scale.

Then there's Draw Something. While the engineers designed it to scale, they had no idea that it would hit hundreds of drawings per second in its first few weeks. In contrast to Instagram, they ended up rewriting their backend data store in Membase during their growth spurt, but that combined with sharding and adding hardware (both vertically via more RAM and SSDs, and horizontally via more machines) got them through their growth period.

10 years ago, scaling to millions of users would have probably taken a large, smart team with months of advance planning to achieve. As a point of reference, Livejournal had about 1.5 million active users per month in 2005, a very popular site at the time, and memcached was created to handle its load. I'm sure that the developers at Draw Something, Instagram and other popular tech start ups are very talented, but it's more than talent that is allow this kind of unplanned growth to happen. The common points between these two teams are not what I was expecting when I sat down to write this post. So what are they?

1. Redis. I was at a NoSQL meetup last night when someone asked "if you could put a million dollars behind one of the solutions presented here tonight, which one would you choose?" And the answer that one of the participants gave was "None of the above. I would choose Redis. Everyone uses one of these products and Redis."
2. Nginx. Your ops team probably already loves it. It's simple, it scales fabulously, and you don't have to be a programmer to understand how to run it.
3. HAProxy. Because if you're going to have hundreds or thousands of servers, you'd better have good load balancing.
4. Memcached. Redis can act as a cache but using a real caching product for such a purpose is probably a better call.
And finally:
5. Cloud hardware. Imagine trying to grow out to millions of users if you had to buy, install, and admin every piece of hardware you would need to do such a thing.

Redis, Nginx, HAProxy, Memcached. All free open-source software. AWS and other cloud vendors let any of us grab machines to run our software at the touch of a button. The barriers to entry have gone way down, and we've all discovered that scaling, while hard, isn't nearly as much of a monster as we made it out to be. It's truly an exciting time to be a software engineer.

Thursday, April 12, 2012

Debug Your Career: Ask for Advice

One of the most powerful relationship-building tools I have found in my career is the simple act of asking for advice. It's also something that I don't see a lot of people take advantage of, especially in tech. In some cases, it's an ego thing. What if they think I'm weak for not knowing, for not figuring this out on my own? Sometimes, it's a desire not to be a bother to the other person. Sometimes it just doesn't occur to you to seek outside help. But when you don't ever ask others for real advice (more than simply, "how does this code work anyway?", or, "what was that cool hack you cooked up the other day?") you leave a lot of potential on the table.

It shows you respect the other person's opinion (and thus, the other person)
We all want to believe that our opinion is important, and when you are actually asked for your opinion it is a token of respect for you as a person. You've heard that people like those that like them back? People tend to respect those that respect them back. After all, you have such good judgement in respecting them!
This really comes to play when you ask people that work for you or are junior to you for their advice, especially when you follow it or use it to open a dialogue on a bigger topic. As a technical leader this will usually come in the form of advice about technology, and let's be honest: none of us has the time to be a true expert in all of the technologies we use in our job. Why would you hoard the decision making on all technical matters? Your employees probably come to you with suggestions and opinions already, but taking care to specifically ask them for their advice is an important way to ensure they know that you respect them and the knowledge they posses.

It makes them a partner in your success
Think about a time when you've helped someone solve a particularly hard bug they were facing. A bug that was meaty and fun and made you think and also made you feel awesome and smart for being involved in solving it. Afterwards, did you feel some ownership stake in the success of that project? This is the feeling you give people when you ask them for their advice, especially when you take it and turn it into success.
Now think about a time when you've hit a bug in someone else's code and they have ignored your attempts to help them debug it, or worse, totally ignored your patches and pull requests. You probably feel a bit more animosity towards the project, and if it fails, perhaps it deserves to fail.
This set of feelings is analogous to proactively going to your manager and asking for advice on how to grown in an area, versus having to be told by your manager that you are failing. In the first scenario, you make your manager a partner in your success, and they feel pride and sense of ownership when you succeed. In the second case, you are at best an area for improvement and a worst a burden or even a write-off. 

It forces you to think about the problem you really have
You shouldn't ask someone for advice without knowing what you're looking for. Just as you're not likely to get much help debugging if all you know is that the program crashes, simply asking someone "Hey, any advice?" will probably get you little better than a cliche like... "ask for advice". Really valuable advice doesn't come from emailing Kevin Systrom and asking how you can get a billion dollars by putting your Hadoop cluster in a Cray-1. It's delivered when you tell your boss you're struggling with the challenge of managing a team and writing code at the same time, and how does he manage to keep an eye on everyone without micromanaging or spending every waking hour watching GitHub and Jira?

Of course, this comes with some stipulations. The act loses its power if you're just doing it for the sake of doing it. You have to be willing to at least consider taking the advice you are given, and you don't want to waste people's time. But if you take the time to truly think about areas where you'd like to improve, and go to people honestly asking for advice on how to do so, you may find that an extra set of eyeballs makes the bugs in your career shallow.

Thursday, April 5, 2012

Parallelism and the Limits of Languages

While we make heavy use of the core of Clojure, we don't use its concurrency primitives (atoms, refs, STM, etc.) because a function like pmap doesn't have enough fine grained control for our needs. We opt instead to build our own concurrency abstractions in Clojure on top of the outstanding java.util.concurrent package.

 Once upon a time, in the days when RAM was not quite as cheap as it is now, I worked on a team that had a giant in-memory cache of data that it used to do various financial calculations. This data had, with the growth of the financial markets it represented, grown to be too big to fit into the memory of a standard server without having GC absolutely wreck the process. Having exhausted all minor efforts to tune the data, and not wanting to buy an actual supercomputer, the team turned to Azul, who at that time sold boxes with enough memory to fit our process and a fairly nifty no-pause GC. Great! Except there was one catch: instead of having the super fast CPUs we were used to, it had hundreds of tiny little slow cores (with rather crappy fpus). And on these hundreds of tiny little cpus our code was very, very slow. So we set out to parallelize.

When you're forced into parallelism at that level of detail, you quickly realize how difficult it is to have a generic way of doing it. The fact that languages like scala and clojure will allow you to just say "apply this logic over the collection in parallel" is very cool, and a great way for developers to dip their toes in parallel processing of their collection data. But it breaks down quickly when you are talking about vastly different types of data running in vastly different types of computation. If you need to eke out every ounce of performance, you want every computation to be tuned to the right batch size and the right number of threads, and the size of the thread pool for task X may depend on whether it tends to run at the same time as task Y or task Z. Don't forget that task Z, if requested, must be prioritized above all else. The size of the batches themselves may depend on the type of object inside them, and how that affects the cache lines of the processor you're running on. Oh, and of course, you're not just tuning this for prod, it has to at least run to completion on your integration boxes and they are of course slightly different processors. And don't forget, your developers aren't writing code on Azul boxes, they're writing code on desktops with 2 cores each; threading even the test data to a pool of 50 doesn't make any sense, but they still need to run things in parallel or they'll never see the stupid threading bugs they added when they forgot to use concurrent data structures.

I've always thought that the next truly big leap in VMs and languages will be those that can do smart things with threading without making developers sweat too much. If we're really entering a world of more and more data over more but slower cores, we need a language and a VM that revolutionize our management of threads the way Java and the JVM have revolutionized our notion of memory management. The functional languages out there help, but you can't forget the underlying VM magic that will really make this possible. Look to the example of memory management: it's not Java itself that enables you to write code without worrying about deleting references only when they are truly not being used, but rather the VM that watches all objects and knows how to find the live ones and delete those that are no longer referenced, all without the developer needing to indicate a thing. For us to conquer the thread problem, we need a VM to observe the behavior of the system beyond just the number of cores and help us to appropriately occupy CPUs over collections of varying size and composition. Languages can force developers to think about applying logic over their data (and thus help with parallelism as a mindset), but they will never solve the problem of automated parallelism without the VM to tune it in the background.

There will always be people that need to tune their code down to exactly what instructions run on what core, just as there are still people that need to do their own memory management, but it shouldn't be a requirement for getting good mileage out of those cores. I can't wait for the day when calling submit feels as archaic as calling malloc.