Time-to-event data are extremely common in industry. Commonly one has timestamps from a log file and wants to gain some understanding of a conversion rate. We can borrow from the manufacturing industry and deploy Weibull analysis.
Weibull analysis can get us a quick measurement of a special statistic called the "characteristic life" — the 0.632 quantile (63.2 percentile) of a conversion rate under the Weibull model. Estimation of characteristic life over many different conversion rates yields a way to compare a basket of conversion rate scales.
The Right-Censoring Challenge
These data are right-censored (Type 1 censored). We collect data from the start of some process and often we observe conversion, but sometimes we do not. The remaining unconverted units could all convert the moment after we pull the data, or they might never convert.
Setting Up the Simulation
suppressPackageStartupMessages({
library(dplyr)
library(purrr)
library(survival)
library(ggplot2)
library(glue)
library(ggthemes)
})
set.seed(0)
N <- 1e3 # observations to generate
shape <- 0.3 # quick stimulus, fast decay
quantile_632 <- 5e5 # ~350 days to 63.2% converted
(x <- rweibull(N, shape = shape, scale = quantile_632)) %>%
tibble(x = .) %>%
ggplot(aes(x = x)) +
geom_histogram(colour = "black") +
ggtitle(glue("Histogram of X ~ Weibull(k={shape}, λ={quantile_632})")) +
xlab("Minutes")
A Censored Weibull Generator
There's no built-in R function for generating censored Weibull data. The idea: draw from two generating processes. If the censoring process returns a time prior to that of the death process, the observation is censored; otherwise we've observed death (conversion).
rweibull_cens <- function(n, shape, scale) {
a_random_death_time <- rweibull(n, shape = shape, scale = scale)
a_random_censor_time <- rweibull(n, shape = shape, scale = scale)
observed_time <- pmin(a_random_censor_time, a_random_death_time)
censor <- observed_time == a_random_death_time
tibble(time = observed_time, censor = censor)
}
Validating the Generator
Always check a random number generator by feeding it known parameters and recovering them:
rweibull_cens(1e6, 0.3, 5e5) %>%
flexsurv::flexsurvreg(Surv(time, censor) ~ 1, data = ., dist = "weibull")
## Estimates:
## est L95% U95% se
## shape 3.00e-01 2.99e-01 3.01e-01 3.31e-04
## scale 4.99e+05 4.94e+05 5.03e+05 2.40e+03
##
## N = 1000000, Events: 500928, Censored: 499072
Confidence Interval Coverage
For validation, I generate 100 samples and check that confidence intervals mostly contain the known parameters, coloring intervals that miss:
seq_len(100) %>%
map(~ rweibull_cens(N, shape = shape, scale = quantile_632)) %>%
map(~ { .$sample <- uuid::UUIDgenerate(); . }) %>%
map(function(x) {
survival::survreg(Surv(time, censor) ~ 1, data = x,
dist = "weibull",
control = survreg.control(maxiter = 100)) %>%
broom::tidy() %>%
mutate(sample = unique(x$sample))
}) %>%
bind_rows() %>%
filter(term == "(Intercept)") %>%
mutate(lower = exp(conf.low), upper = exp(conf.high)) %>%
mutate(actual = quantile_632) %>%
mutate(miss = ifelse(
actual < lower | upper < actual,
"Failed to include real parameter",
"Included parameter")) %>%
ggplot(aes(y = factor(sample))) +
geom_errorbarh(aes(xmin = lower, xmax = upper,
colour = factor(miss)), height = 0) +
geom_vline(aes(xintercept = unique(actual)), col = "red") +
scale_color_discrete(name = "") +
theme_bw(18) +
theme(axis.text.y = element_blank(),
legend.position = "top")