During the past few months, my co-workers and I have been thinking about the data we have, and how we can use it to answer some interesting questions for the visitors to our site. Some great suggestions and starting points came from Programming Collective Intelligence
, which is one of the most interesting programming books I’ve read all year. From there we moved onto some simple data clustering, which gave us some basic data to use in some experiments we wanted to try.
For one of these experiments, we took some “Yes” and “No” responses to a question we ask when performing an action on the site, and tried to predict the probability of “Yes” answers for other items that had not had any responses. It seemed easy at first, but I amazed myself at how much math and statistics I forgot only a few years out of school!
After a few dead-ends, we discovered that the problem had two parts:
- For each known object value in a cluster, calculate a probability, based on the number of “Yes” responses, that the answer to the question we asked was actually “Yes”
- For the unknown object, grab all the known objects in its cluster, and calculate an average of their probabilities, weighted by the inverse of their distances.
After a few dead ends, I stumbled upon two pages that described the steps we needed to take in order to do this prediction right.
The first showed up in a conversation about Reddit’s upgrades to its comment ranking system. This page is a quick overview of the idea behind the formula, and this page describes the formula being used. It also called out our first two attempts at solving this problem as stupid — no argument here.
The second was a little trickier to find, but once we went down enough wrong paths that I finally got the search terms right, a google search turned up a Wikipedia page that solved the problem. Here’s the ruby code I whipped up based on the information on that page:
This gave us some very reasonable-looking data, and will hopefully lead to some really cool features!