As you may guess from the title of this post, the release was not entirely smooth for our infrastructure team. The functionality got out successfully, but two hours after we released we started noticing slowness on the pages, and a quick audit showed frequent full GCs on the services. Some rogue caching was being exercised much more than we had seen during load testing. After some scrambling, we resized the machines and restarted the VMs with more memory. Fortunately the cache would only get so big, and we could quickly throw more memory onto the machines (thank the cloud!). Crisis averted, we set to fixing the caching so that we wouldn't hit slow FGCs.
The fix seemed fairly straightforward; take the cache, which was originally caching parameters mapped to objects, and instead just cache the object primary ids. So the project lead coded up the fix, and we pushed it out.
Here's the fix. Notice anything wrong? I didn't. We're big fans of Guava and use List transformers all over the place in our code base. So we load test that again, and it looks ok for what our load tests are minimally worth, so we push it onto one of our prod boxes and give it a spin.
At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?
I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class.
At first, it seemed just fine. It hummed along, seeming to take less memory, but slowly but surely the heap grew and grew, and garbage collected more and more. We took it out of the load balancer, forced a full GC, and it still had over 600m of active heap memory. What was going on?
I finally took a heap dump and put the damned thing into MAT. Squinting at it sideways showed me that the memory was being held by Ehcache. No big surprise, we knew we were caching things. But why, then, was it so big? After digging into the references via one of the worst user interfaces known to man, I finally got to the bottom of an element, and saw something strange. Instead of the cache element containing a string key and a list of strings as the value, it contained some other object. And inside that object was another list, and a reference to something called "function", that pointed to our base class.
As it turns out, Lists.transform is a lazy function. Instead of applying the transformer to the list immediately and returning the results, you get back an object that acts like a list but only applies the transform on the objects as you retrieve them the first time. Which is great for saving a bit of time up front, but absolutely terrible if you're caching the result to save yourself memory. Now, to be fair, Guava tells you that this is lazy in the javadoc:
The best part of this exercise was teaching other developers and our ops folks about the JVM monitoring tools I've mentioned before; without jstat -gc and jmap I would have been hard-pressed to diagnose and fix this problem as quickly as I did. Now at least one other member of my team understands some of the fundamentals of the garbage collector, and we've learned a hard lesson about Guava and caching that we won't soon forget.
Great post! Congratulations for solving this tough bug. I am still somewhat of a clueless when it comes to monitoring. Even more motivated to learn after reading your post! :-)
ReplyDeleteNevertheless, I think one great lesson you could add under your belt is: know your APIs *very* well. Even more if it's as frequently used as Lists.transform(). Fortunately, Guava has one of the best, if not the best, javadocs I've ever seen in a OSS. Guava devs have shown more than once that they spend much time on making their javadocs as comprehensive and useful as possible. While many other open source projects just say to go f#%ck ourselves and figure it out from sources. :(
When I am really (*really*) bored I read java.util and Guava javadocs. If I remember well, Joshua Bloch's Effective Java recommends this practice.
Finally, I agree with you that this lazy behaviour should be highlighted on bold font on Guava javadocs (open a ticket?). As a Guava user myself I find Guava devs quite fond of lazy data structures. My bet is the huge volume of data their apps usually need to churn (e.g., retrieving thousands of e-mails in GMail on demand).
PS: Why not use ImmutableList.copyOf() instead of ArrayList? Or ImmutableSet? It doesn't look like you are going to update this List...
Haha thanks for the code review via blog comment! I didn't even think about using an ImmutableList, that's a good call. I should probably spend a bit more time reading the Guava javadoc...
DeleteHave you lost your love for yourkit? That's still my default tool for memory problems.
ReplyDeleteHehe no I just didn't feel like getting a temporary license, etc etc etc. But I swore the next time I had to do this I was going to make JTB buy me a YourKit license.
DeleteThis comment has been removed by the author.
DeleteI'm not sure this was the best case for the use of a transformation--especially since it is neither reused nor composed with other transform functions. It feels like the use of tranform in this case made the code buggy and less readable.
ReplyDeleteAnother approach would be something like:
ImmutableList.Builder builder = ImmutableList.newBuilder();
for (Item item : Items) {
builder.add(item.getStyleName);
}
List styleNames = builder.build();
It's fewer lines of code, the intent is explicit, and it creates fewer objects. I'm willing to bet that it's faster, although not algorithmicaly so.
It looks like blogger stripped out the generics in the above snippet, but you get the idea. :-)
Delete