Modeling subscription churn

Note: this is a work in progress please send feedback to @statwonk

Subscription churn usually clusters around the first day and each subsequent bill. “Buyers remorse” is a common phrase for early churn and it’s no surprise that subscribers churn when presented with a bill. It’s important to model these clusters well because they’re where most customers churn.

Let’s generate some toy data reflecting this common business scenario.

library(eha)

example_hazard_rates <- c(
  0.06, # buyer's remorse
  0.003,
  0.055, # second bill
  0.002,
  0.05, # third bill
  0.002,
  0.04, # fourth bill
  0.004)

# It seems that the eha::rpch function has a little bug
# with failing to pass the cuts, levels params. This is
# a quick fix.

# First generate piecewise-constant survival data
rpch <- function(n, cuts, levels) {
  x <- runif(n)
  qpch(x, cuts, levels)
}

x <- data.frame(
  times = rpch(
    n = 400,
    cuts = c(1, # initial purchase
             31, 34, # second bill
             62, 65, # third bill
             93, 96 # fourth bill
    ),
    levels = example_hazard_rates
  ),
  censor = 1
)

plot(
  survfit(Surv(times, censor) ~ 1, data = x),
  xlim = c(0, 100),
  ylim = c(0.5, 1),
  conf.int = TRUE,
  main = "Subscription lifetime",
  ylab = "% still paying",
  xlab = "Days since subscribing"
)

Alright, now we have our toy dataset.

Typical parametric models will fail in the face of buyer’s remorse and billing churn, so we need something new.

library(flexsurv) # <-- awesome library; Christopher Jackson, you rock!
fitg <- flexsurvreg(
  Surv(times, censor) ~ 1,
  data = x, dist = "gengamma")
plot(fitg, xlim = c(0, 100), ylim = c(0.5, 1),
     main = "Subscription lifetime",
     ylab = "% still paying",
     xlab = "Days since subscribing")

pfit <- phreg(
  Surv(times, censor) ~ 1, data = x,
  dist = "pch",
  cuts = c(1, # initial purchase
           31, 34, # second bill
           62, 65, # third bill
           93, 96 # fourth bill
  )
)
survival_estimates <- function(x, xlim = c(0, 100)) {
  if(x$n.strata > 1) stop("Developed for a single strata.")
  ncov <- length(x$means)
  npts <- 4999
  xx <- seq(xlim[[1]], xlim[[2]], length = npts)
  ppch(xx, x$cuts, x$hazards[1, ], lower.tail = FALSE)
}

estimates <- survival_estimates(pfit)
plot(fitg, xlim = c(0, 100), ylim = c(0.5, 1),
     main = "Subscription lifetime",
     ylab = "% still paying",
     xlab = "Days since subscribing")
lines(
  x = seq(0, 100, length = length(estimates)),
  y = estimates,
  col = "blue")

estimated_hazard_rates <- piecewise(
  enter = 0, exit = x$times, event = x$censor,
  cutpoints = c(1, # initial purchase
                31, 34, # second bill
                62, 65, # third bill
                93, 96 # fourth bill
  ))$intensity

… compare to our original example hazard rates,

data.frame(
  actual = round(example_hazard_rates, 3),
  estimates = round(example_hazard_rates, 3)
)

##   actual estimates
## 1  0.060     0.060
## 2  0.003     0.003
## 3  0.055     0.055
## 4  0.002     0.002
## 5  0.050     0.050
## 6  0.002     0.002
## 7  0.040     0.040
## 8  0.004     0.004

Perfect! We don’t need statistics! :-p Since we’re specificing the random data upfront, we’re able to choose perfect cut points from a known distribution. It’s no suprise the model fits very well. In reality, we need to take this a step further by bootstrapping the estimates for a measure of uncertainty similar to our plotted parametric model. This is what we’ll want to deploy on our actual churn data.

piecewise_fit <- function(d, i) {
  survival_estimates(
    phreg(
      Surv(times, censor) ~ 1, data = d[i, ],
      dist = "pch",
      cuts = c(1, # initial purchase
               31, 34, # second bill
               62, 65, # third bill
               93, 96 # fourth bill
      )
    )
  )
}
library(boot)
library(dplyr)
library(reshape2)
pch_estimates <- t(boot(x, piecewise_fit, R = 100)$t) %>%
  as.data.frame() %>%
  tbl_df() %>%
  mutate(index = 1:n()) %>%
  melt("index") %>%
  tbl_df %>%
  group_by(index) %>%
  summarise(lower = quantile(value, 0.05),
            estimate = mean(value),
            upper = quantile(value, 0.95))

plot(fitg, xlim = c(0, 100), ylim = c(0.5, 1), conf.int = F,
     main = "Subscription lifetime",
     ylab = "% still paying",
     xlab = "Days since subscribing")
dashed_lines <- function(y) {
  lines(
    x = seq(0, 100, length = nrow(pch_estimates)),
    y = y,
    col = "blue",
    lty = 2)
}
lines(
  x = seq(0, 100, length = nrow(pch_estimates)),
  y = pch_estimates$estimate,
  col = "blue",
  lty = 1)
dashed_lines(pch_estimates$lower)
dashed_lines(pch_estimates$upper)

Very cool. Now we’re on our way to accurately measuring churn. The piecewise-constant model assumes a steady exponential rate between each interval. It is a parametric model. This isn’t a terrible assumption with real subscription data, though and it gives the problem tractibility. If you want to deploy an experiment that targets days 31 - 34 (when customers receive the first bill), using a parametric model will yield statistical power for a faster turnaround. Also, we can see that the generalized gamma distribution, a distribution that includes many other distributions, can’t cope with the quick drops caused by buyer’s remorse and bills. The piecewise-constant model yields a great fit for this type of data and can enable high-precision experiments.

@statwonk

Modeling subscription churn

Statwonk

May 22, 2016