We left off the discussion with
We want to calculate the proportion of fake ballots in an election based on the results of limited audits. We have seen how the binomial and hypergeometric distributions give probabilities for the results of an audit given an assumption about the proportion of fake ballots. Bayes theorem can be used to calculate the inverse probability that we are after, once we have specified a prior.
Bayesian inference is a process that takes prior information, and adds evidence to obtain a posterior distribution. In our case this posterior distribution will be over the possible proportion of fake ballots in the set of all ballots. Let’s begin with the binomial case. What prior should we use? One answer is that, since we know nothing about the proportion of fake ballots we should be indifferent about each possibility. This translates into a uniform prior, where all proportions are equally likely. For example
P(proportion = fake ballots/ total ballots) = 1 / (total ballots + 1)
Since there are n + 1 possibilities for the number of fake ballots, we give each of them the same weight, which is 1 / (n + 1).
Beta + Binomial = Beta-Binomial
Before plugging this into bayes, a small technical detour. Notice how the prior is itself a probability distribution, defined over the 0.0 – 1.0 interval. That is, the minimum proportion (0.0) is no fake ballots and maximum (1.0) is all fake ballots. It turns out there is a paramateric probability distribution one can use for this interval, it’s called the Beta distribution. The Beta distribution has two parameters, alpha and beta. The case of our neutral prior we defined above is equivalent to the Beta distribution with parameters (1, 1)
P(proportion ) = 1 / (n + 1) = Beta(1, 1)
We could express other knowledge with different choices of alpha and beta. But what’s the point of using the Beta, besides having a convenient way to specify priors? The point is that the Beta distribution is a conjugate prior of the binomial distribution. This means that the posterior distribution, once having taken into account available evidence, is also a Beta distribution. Meaning that the calculation of the posterior is much easier, as inference is just a matter of mapping the values of parameters of the Beta to some other values. Here is the posterior of the Beta distribution when it is used as the prior of the binomial (this is called the beta-binomial model).
Equations taken from [1]. The first line is just Bayes theorem, but the payoff is that the last line corresponds to a beta distribution, but with different parameters. In summary
with a beta prior, bayesian inference reduces to remapping the initial parameters alpha and beta, to alpha + k and beta + n – k, where k is the number of successes and n is the number of trials. Conjugate priors are an algebraic convenience that allow obtaining analytic expressions for posteriors easily. End of detour, please refer to [1] for further details.
Armed with our use of the beta-binomial obtaining the posterior given some audit results is simple. If we audited 10 ballots and 3 of them were fake our posterior would simply be
P(proportion = p | fake audit count = 3 out of 10)
= Beta(1 + 3, 1 + 10 – 3)
= Beta(4, 8)
here’s what Beta(4, 8) looks like
note how the peak of the distribution is at 0.3, it makes sense since in the sample 3 out 10 ballots where invalid. Evidence has transformed our initial uniform prior into the distribution seen above. This meets our original objective, a way to judge how many ballots are fake in the entire set of ballots based on limited audits. But it’s not the end of the story. What we would like also is to have an estimate as to whether or not the election result is correct. As we said previously, this estimation can be used either as a guarantee that all went well or in the opposite case to detect a problem and even invalidate the results.
The probablity that an election result was correct, given uncertainty about fake ballots, depends on two things. One is the proportion of ballots that are fake, this is what we already have a calculation for. The other thing is the actual election outcome; specifically a measure of how close the result was. The reason is simple, if the election was close, a small number of invalid ballots could cast doubts on its correctness. Conversely, if the result was a landslide, the presence of fake votes has no bearing on its correctness. For our purposes we will stick with a simple example in which the election decides between two options via simple plurality.
Call the difference between the winning and losing option d
d = winner votes – loser votes
In order for the election to be wrong, there must be a minimum of d fake votes. The existence of d fake votes does not imply that the result was wrong, but d fake votes are a necessary condition. Thus a probability that the number of fake votes is greater than or equal to d represents an upper bound on probability that the election was incorrect. Call this E (for error)
P(proportion of fake votes >= d / total votes) = E
(upper limit on the probability that the election was wrong)
We have P(proportion), it is the posterior we got above. How do we get P(proportion >= some constant)? Through the beta distribution’s cumulative distribution function, which is defined in general as
In order to reverse the inequality, we just need to subtract it from 1 (gives us the tail distribution). We finally have
Probability of incorrect result
= P(proportion >= d / total ballots)
= 1 – Cumulative Distribution Function of P(d / total ballots)
One final correction. Because we have sampled a number of votes with known results, we must apply our calculations to the remaining ballots.
P(E) = 1 – CDF(d – sampled ballots / total ballots – sampled ballots)
Let’s try an example, an election between option A and option B with the following numbers.
Votes for A = 550
Votes for B = 450
Total ballots = 1000
Audited ballots = 100
Audited fake ballots = 4
which gives
Posterior = Beta(5, 97)
d = 100
Minimum fraction of fake votes required to change result = (100 – 4) / (1000 – 10) = 0.1066
Upper bound on probability of error
= 1 – CDF(Beta(5, 97), 0.1066)
= 0.01352
In conclusion, the probability of error due to fake ballots in this election is less than or equal to 1.35%.
You can find a javascript implementation for everything we’ve seen until now in this jsfiddle. Calculations for the binomial, beta, hypergeometric and cumulative distribution function are done with the jStat javascript library. Wait, you say, what about the hypergeometric? We’ll leave that for the next post, which should be pretty short.
[1] http://www.cs.cmu.edu/~10701/lecture/technote2_betabinomial.pdf