Bernoulli Experiments & Maximum Likelihood with Stan

Christopher Peters ยท Bayesian Methods

I'm a big fan of expressing math as code. While it's not as terse, I think it's a lot easier to use to learn statistics. In this post I'll break down a small set (three!) Bernoulli experiments. Binary data is really common in business, so it pays to work through the details.

At the end I'll use Stan to generate a probabilistic estimate of p̂. Markov chain Monte Carlo is overkill for this type of problem, but in order to learn I find it's best to start with simple things.

Generating Bernoulli Experiments

First, I'll generate three Bernoulli experiments with pretend "true" parameter values:

library(rstan)
library(dplyr)

dbernoulli <- function(x, p) dbinom(x, 1, p)
rbernoulli <- function(n, p) rbinom(n, 1, p)
set.seed(1)
N <- 3L
(p <- runif(1))
## [1] 0.2655087

(coin_flips <- rbernoulli(N, p))
## [1] 0 0 1
Observed coin flip proportions

The Likelihood Function

The most crucial part of doing high quality statistics is to understand how to create and calculate a function that is at worst proportional to the probability of the observed data. This is the likelihood function.

The Bernoulli PDF is: f(x; p) = px × (1 - p)1-x where x can only take the values 0 or 1.

We evaluate the density at each data point and multiply the terms:

bernoulli_likelihood <- function(p_to_try) {
  prod((function() { dbernoulli(coin_flips[1], p_to_try) })(),
       (function() { dbernoulli(coin_flips[2], p_to_try) })(),
       (function() { dbernoulli(coin_flips[3], p_to_try) })())
}

Manual Bisection Search

bernoulli_likelihood(p_to_try = 0.0001)
## [1] 9.998e-05
bernoulli_likelihood(p_to_try = 0.5)
## [1] 0.125
bernoulli_likelihood(p_to_try = 0.25)
## [1] 0.140625
bernoulli_likelihood(p_to_try = 0.375)
## [1] 0.1464844
bernoulli_likelihood(p_to_try = 0.33)
## [1] 0.1481469
bernoulli_likelihood(p_to_try = 0.3375)
## [1] 0.1481934

We estimated p̂ as ~0.33, not exactly 0.26 but also not bad for only observing three data points! One can always complain that their sample size isn't large enough, but a smarter alternative is to become really good at squeezing the most information out of the least data.

Stan MCMC

Now let's use this as an opportunity to learn some Stan:

model_code <- "
data {
  int N;
  int y[N];
}
parameters {
  real p;
}
model {
  y ~ bernoulli(p);
}"

fit <- stan(model_code = model_code,
            data = list(N = N, y = coin_flips))

extract(fit, pars = "p") %>% unlist() %>% density() %>%
  plot(main = "Posterior density of p")
Posterior density of p from Stan MCMC
Key Takeaway: From just three Bernoulli observations (0, 0, 1), both manual bisection and Stan's MCMC converge near the true parameter value of 0.27. No sample is too small to extract information from โ€” the key is constructing a proper likelihood function.
Discuss a Bayesian Analysis Project