Last week I wrote about scaling and some examples of recent scaling successes. That post has been in my mind this week as I've been thinking about what I need to be doing here at Rent the Runway to enable our systems to scale.
Unlike the developers at Instagram or Draw Something, I'm not coming to this problem from a fresh code base with a very small, focused team. Instead, I'm taking a system that has been in production for almost two and a half years and trying to correct a series of short-sighted decisions made over that time, all while supporting a thriving business with more feature requests and ideas than a team of a hundred devs could handle. The problem is in some ways smaller than your typical scaling story: I don't have to handle anywhere near the volume of a popular app. But I can't afford to make changes that would bring the business to a halt for hours, lose orders, or block our warehouse. So we replace the jet engine one bolt at a time, all while flying over the Atlantic Ocean.
Our scaling problems arise mostly from the decision, in the initial website implementation, to create our site in Drupal and have most of the business logic live in PHP code that resides in that Drupal server. This was a bad choice. Drupal is not an e-commerce platform, and using it as such is a nightmare. The business logic, entangled with the view logic, is not easy to test, and every change to that code requires a re-release of the whole site. Drupal forced certain database modeling decisions that are almost impossible to reason about, and the end result of all of this is a rat king of PHP, SQL, and CSS/JS. Even if we could scale our site by throwing money at database servers, we can't scale our feature velocity in such a system.
So the first scaling problem I've been tackling in my time here has not been the ability to handle more users on the site. It's been the ability for our developers to actually deliver more features in a timely manner without nightmarish releases. This effort was started by my boss before I joined, and eight months later, we're really starting to see the fruits of our labor.
Our first attempt was to pull heavy logic behind our reservation system out into a Java service. In the process, we learned exactly how daunting the task ahead of us would be. Have you ever been tempted to say, this won't be too hard to rewrite, only to eat your words? We ate our words several times over in this project. From the migration and reconciliation of old data, to the performance of the new logic, to the integration with our warehouse code, we got hammered over and over. This was a system that had to be replaced all at once, with no opportunity for iterative releases. The new Java service, while not perfect, was fairly straight-forward to write, test, and debug. But the PHP code that had to change to use this service was untested, and we made the fatally stupid mistake of letting it remain so throughout the development process. The project ended up taking several months and almost the entire team to complete.
Learning our lesson from that, we've moved to some pieces of the system that we can replace iteratively: storing more of our product metadata inside of MongoDb and providing a service layer to serve new data while slowly moving existing functionality onto that system. This has been much easier than the first project, launching three new features over three weeks all while improving site performance and sparking a creative revival among the developers. When you have fast and easy access to your data, and you can release whenever you feel good about your testing, you turn your feature scaling from being throttled by your platform to being throttled by your own creativity and developer hours.
So today, I gave our developers a talk about the long-term (9-18 month) goals for our system design. It's straightforward: continue to move business logic out of Drupal, write services that are horizontally scalable, use some judicious vertical scaling where needed. Cache read-only data as much as possible, think about sharding points, plan for more users who stay longer and interact more heavily. And as we think about user growth in addition to feature growth, don't forget to add performance testing to our bag of tricks. It's great to be here, and to recognize the scaling problems we have already conquered to get to this point.
Hi Camille,
ReplyDeleteThis was a great post that touches on issues that many, many projects face. The slides you referenced are a clear roadmap could probably be used by a lot of teams! I'm sure I'll be referring to them with my team.
I also work for a web-based retail business and over here we are slowly coming around to the idea that it may actually be better to do our load testing in prod. Evidently, Etsy does a significant amount of their load testing prod. If you take this approach, instead of spending a lot of money to try to get QA to equal production, you spend a lot of money instrumenting the heck out of prod. If your business is as successful as you hope, you could end up with something like Amazon, where it really is impossible to make QA equal prod. And even if your're not as big as Amazon, the firewalls, routers, load balancers, CDN, etc. may be more expensive to put into QA than the required amount of monitoring in prod.
I'd be happy to share more of my thinking on this if you're interested. And b.t.w., we've actually met in person at the hair dryer CTW. Congratulations on your new position!
Cheers, Rebecca