(Cross posted from here.)
When voting, people select among several options that have been predefined prior to the election. In more open scenarios of citizen participation, people are asked not only to vote for predefined options, but to contribute their own ideas before the final vote. This introduces another activity to the democratic process, we can call it filtering.
The aim of filtering is to select, out of a very large number of ideas, the very few best ones either to directly implement them or to put them to a formal vote. Filtering has to address the problem that voting cannot: How do we select out of potentially hundreds of ideas without having each voter evaluate them all? The solution provided by filtering is a common theme in many proposals to augment democracy: division of cognitive labour.
A single voter cannot rank hundreds of ideas, but the cognitive load of selecting the best ones can be spread across the many citizens that are contributing them. Filtering can scale because as the number of ideas grows with the number of people suggesting them, so too does the available cognitive effort available to evaluate them.
However, a key distinction must be noted between voting and filtering. The process of distributing cognitive labour allows filtering many ideas, but once we accept that not all people are evaluating everything, the information available is necessarily incomplete. In other words, the process of ranking ideas has an element of uncertainty, it is statistical.
Reddit-style filtering
When looking for tools to implement internet mediated filtering, sooner or later one runs into systems like reddit, slashdot, digg and the like. At an abstract level, we can pick out the following features which are relevant to the problem of filtering:
- Many users can submit a large amount of items as well as evaluate them
- Users evaluate items with a binary signal, eg voting up/down
- A ranking method exists that sorts items for some definition of better => worse.
At this level of description, these three properties fit filtering well. We have ideas that we wish to rank, and we cope with the large volume by having the users evaluate them, possibly using binary signal such as approval (or not approval). The meat of the matter lines in the third property, specifically in the phrase “for some definition of better => worse”.
Quality of ideas
Above we have used the abstract term “items”. In reddit, items can be stories or comments. The distinction is important because reddit uses different ranking algorithms for each. The key feature of filtering that departs from the reddit model is that reddit and similar sites are news aggregation tools. This is why stories in reddit include a novelty component in their ranking. Ceteris paribus, new items are better than old items.
Filtering on the other hand is not about news, but about ideas/proposals. Although novelty may still be a factor, it is not a central one. Idea quality must be mostly about voters evaluation through a binary signal, and not very much about its submission date. Filtering ranking is more like reddit’s ranking of comments than it is of stories.
The problem is simple then: we must rank ideas according to users’ binary evaluation in a way that deals with incomplete information. That is, we will not have a +/- judgement for every idea from every user, only subsets of this. Fortunately the problem as posed is directly an instance of binomial proportion estimation, an urn model.
We have, for every idea, an unknown quantity which represents the fraction of all users who would evaluate it positively. This quantity can be statistically inferred using the number voters that have in fact evaluated the idea and approved it. To do this we use the beta–binomial model, see here for an introduction. Briefly, we choose a uniform uninformative prior distribution with parameters alpha = beta = 1. Because the beta is conjugate to the binomial, posterior distribution is also a beta given by
The mean of a beta distrubution is given by
If we plug this into the posterior we see that the mean is simply approvals / total, which makes sense. So a first approximation would be to rank ideas by the mean of the (posterior) probability distribution, which is just the fraction of approvals out of total evaluations.
What about uncertainty?
The problem with the above is that it ignores how confident we are about the estimation of an idea’s quality. For example, an idea evaluated 1000 times with 500 approvals would rank equally to an idea evaluated twice and approved once. In technical terms, using the mean throws away information about the probability distribution. This is what the author of this post realized:
suppose item 1 has 2 positive ratings and 0 negative ratings. Suppose item 2 has 100 positive ratings and 1 negative rating. This algorithm puts item two (tons of positive ratings) below item one (very few positive ratings). WRONG.
The author then goes on to suggest a lower bound wilson confidence interval, this is how reddit currently ranks comments. Another option along these lines is to use a lower bound on our beta posterior using its cumulative distribution function (I wont go into more details here). In either case, the spirit of this measure is to rank ideas that we are confident are good higher than ideas which have not been evaluated enough.
What about equality/winner takes all/sunk ideas/information gain?
This is where filtering departs strongly from systems like reddit. Consider:
- Ideas should have an equal opportunity of being selected.
- Once ideas are ranked highly there is a snowball effect as they are more visible and can accumulate approvals.
- New good ideas will be lost and sunk as they cannot compete with the volume of older ones that have accumulated approvals.
- By penalizing uncertainty you concentrate evaluation away from where its most needed, information gain is minimized.
All these objections follow a common theme, the theme of balancing two competing objectives:
- We want to rank good ideas highly
- We want to allow for the discovery of good yet unproven ideas
Thompson Sampling and multi-armed bandits
The above dilemma is another form of the exploration-exploitation tradeoff that appears in reinforcement learning, an area of machine learning. There is one specific problem in machine learning whose structure is very similar to the problem of rating items and filtering: multi-armed bandits. In this post James Neufeld first makes the connection between rating comments on reddit and muti-armed bandits.
The comment scoring problem on reddit is slightly different from the basic setting described above. This is because we are not actually choosing a single comment to present to the user (pulling one arm) but, instead, producing a ranking of comments. There are some interesting research papers on modelling this problem precisely, for example [Radlinski, et al.,2008], but, it turns out to be a combinatorial optimization (hard). However, rather than going down this complex modelling path, one could simply rank all of the μ¯iμ¯i samples instead of taking just the max, this gives us a full ranking and, since the max is still at the top, is unlikely to adversely affect the convergence of the algorithm.
He then goes on to propose adapting Thompson sampling, a solution applied to multiarmed bandits in reinforcement learning, to the case of rating comments in reddit. The method of Thompson Sampling (or probability matching) is simple given what we’ve seen above regarding the beta posterior. Instead of using the beta posterior mean, or a lower bound on its cumulative distribution, we simply sample from it. The procedure is:
- For each idea, sample from its posterior beta probability distribution
- Construct a ranking according to the sampled value
Here’s how Thompson sampling relates to the two objetives stated above
- We want to rank good ideas highly
Sampling from a high quality posterior will tend to produce larger values.
- We want to allow for the discovery of good yet unproven ideas
Because we are sampling there is a chance that unproven ideas will be ranked highly. Furthermore this possibility is greater for better ideas and for ideas with high uncertainty. There is more to be investigated here about the relationship between Thompson sampling and maximizing information gain (eg Kullback Leibler divergence).
Extensions
The base Thompson sampling model can be extended in several ways. For example:
- Time: we can incorporate time factors by having decays on the beta distribution parameters.
- Collaborative filtering: we can apply weight factors on beta parameters depending on affinity with the user.
- Boosting newly submitted ideas: we can choose non uniform priors to ensure high exposure to new ideas.