Buy My Book, "The Manager's Path," Available March 2017!

Wednesday, January 4, 2017

Hey Diddle Diddle, Data to Fiddle

When I worked in finance ages ago, there was a system used by many (but not me!) that was basically a combination of a gigantic distributed database plus a scripting language that allowed you to run calculations over information in that database. One of the things that you could easily do, as far as I understand, was "diddle" a piece of information. The "diddle" would change that piece of data inside of a particular scope, so that you could quickly see different calculations over the graph, without necessarily persisting that data back to the larger system. This was a useful construct for exploring what might happen with changes to different input data and exploring different scenarios. (The first half of this blog post provides some insights into how the system might have worked).

Whether my understanding is exactly right or not is irrelevant except that this concept of "diddling" stuck with me. There are times when what you want to do is take persistent data, make a small change to it on the fly, and use the results of that change without necessarily persisting it back to the original data set.

I've often thought that this concept is particularly useful in places like personalization. Imagine the situation where you have a complex set of results that you wish to display to a user, like say, the google search results. We all know that the ranking system for google results is a complex beast, relying on a huge amount of precomputed data, for example, the links between pages. But now, to compute that graph with personalization taken into account? You're probably not calculating all personalization vectors on the fly, but when I go to search for "java" you're also probably not doing a lookup of pre-computed personalized results for "java + userid:camille." Instead, you're applying a "diddle" function to the top set of the overall graph, and showing me the results in the diddled order that makes sense for me.

There are two parts to the concept that make it powerful for me. The first is the idea that you are changing things temporarily. To serve large sets of results fast (or, in the case of google, to be able to function at all), you need to pre-calculate a huge amount of data. You're doing a complex piece of work that takes some time, you don't want to have to redo it for every request. However, you don't want to force yourself to store all the work for all possible scenarios up-front. But, diddling in my mind presents a second element: it is only applied to the set of data within a limited scope. You don't diddle across the entire search index. You diddle the first few results, the ones that matter to the user in question.

There can be a ton of technical complexity to implement such a concept in practice. One immediate challenge is that of "diddling" in such a way as to drop results from the top set, thus requiring a re-querying for additional responses to get enough data to satisfy the user. The purpose of this post is not to go into the technical details of how you might implement such a thing, but to show you that you can reframe your thinking on a problem like this through its phases. Just because you have a list of thousands as your first pass of results doesn't mean you need to personalize across that whole data set to get the best results for the end user. If your goal is to get the most personally relevant of the most generally relevant, you probably want to operate on the top of the generally-ordered list, not necessarily the whole list itself.

There's many ways to attack such problems, and I have no idea how companies like Google solve the challenge of personalizing results under the hood. I do know that, to me, the idea of mixing indexed and computed results in personalized querying is a sticky one, and it's an analogy I use frequently. It helps you remember that there is value in the underlying order of the results as provided by the source of truth, and that personalization is often an enhancement on an underlying set of computed data, not the fundamental computation itself. Pulling out to a larger picture, remember that when your data gets in front of a human end-user, they are going to operate on only the tiny surface area they, as a human, can process at any one time. So in cases where you're serving humans, you can apply different patterns on the fly to the human-visible surface area that would be too expensive to apply to the entire data set at large.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.