In a previous post, we showed that Bayesian hierarchical models (BHMs) win when group structure is describable, Bayesian neural networks (BNNs) win when the signal is complex, and their ensemble hedges across both scenarios.
But what about the case where your data has both — known group structure and complex nonlinear signal? Neither the BHM nor the BNN is ideal:
- The BHM captures group heterogeneity but can't represent the nonlinear signal
- The BNN learns the nonlinear function but ignores group structure
Enter the hierarchical Bayesian neural network (HBNN) — a model that combines partial pooling over groups with a flexible neural network function class. This post adds the HBNN to our comparison and shows when each of the four approaches excels.
The Four Models
| Model | Group Structure | Nonlinear Flexibility |
|---|---|---|
| BHM | ✅ Group-varying intercept + slope | ❌ Linear in x |
| BNN | ❌ No group information | ✅ tanh hidden layer |
| HBNN | ✅ Group-varying intercepts | ✅ tanh hidden layer |
| Ensemble | Partial (via BHM component) | Partial (via BNN component) |
The Hierarchical BNN in Stan
The HBNN adds partially-pooled group intercepts to the neural network. The non-centered parameterization lets Stan explore the posterior efficiently:
parameters {
real alpha_0; // population intercept
real sigma_alpha; // group intercept spread
vector[J] alpha_raw; // non-centered group deviations
vector[H] w1; // input→hidden weights
vector[H] b1; // hidden biases
vector[H] w2; // hidden→output weights
real b2; // output bias
real sigma;
}
transformed parameters {
vector[J] alpha = alpha_0 + sigma_alpha * alpha_raw;
}
model {
...
for (n in 1:N) {
vector[H] h = tanh(w1 * x[n] + b1);
mu[n] = alpha[g[n]] + b2 + dot_product(w2, h);
}
y ~ normal(mu, sigma);
}
Line by line, this is a BNN with one addition: alpha[g[n]] — a group-specific
intercept drawn from a learned population distribution. The neural network learns the shared
nonlinear signal f(x), and the hierarchical intercepts capture group-level shifts.
Each component does what it's best at.
Three Data-Generating Processes
DGP 1: Grouped Linear — BHM Excels
# Group-specific intercepts and slopes, linear signal
alpha_true <- rnorm(J, 0, 1.5)
beta_true <- rnorm(J, 2, 0.8)
d1 <- tibble(g = rep(1:J, each = n_per), x = runif(N, -2, 2)) %>%
mutate(mu = alpha_true[g] + beta_true[g] * x,
y = rnorm(n(), mu, 0.5))
The signal is linear and the heterogeneity is in group-specific intercepts and slopes. The BHM is perfectly specified — its functional form matches the DGP exactly. The HBNN's neural network component adds unnecessary flexibility here, making it a close second but slightly less efficient due to the additional parameters.
DGP 2: Complex Nonlinear, No Groups — BNN Excels
# Nonlinear signal, group labels are arbitrary
d2 <- tibble(g = rep(1:J, length.out = N), x = runif(N, -3, 3)) %>%
mutate(mu = 2 * sin(1.5 * x) + 0.5 * cos(3 * x),
y = rnorm(n(), mu, 0.4))
No group-level heterogeneity — the group labels are noise. Both the BHM and the HBNN waste capacity estimating group intercepts that don't exist, but the BNN focuses entirely on learning the nonlinear signal. The HBNN is close because partial pooling drives the unnecessary group intercepts toward zero, but the BNN's simpler parameterization gives it an edge.
DGP 3: Grouped + Complex Nonlinear — HBNN Excels
# Group offsets PLUS nonlinear signal
alpha_true3 <- rnorm(J, 0, 1.5)
d3 <- tibble(g = rep(1:J, each = n_per), x = runif(N, -3, 3)) %>%
mutate(mu = alpha_true3[g] + 2 * sin(1.5 * x) + 0.5 * cos(3 * x),
y = rnorm(n(), mu, 0.4))
Now the data has both group-specific offsets and a complex nonlinear signal. This is the scenario the HBNN was built for:
- The BHM handles the group offsets but can't capture the nonlinear signal
- The BNN learns the nonlinear signal but conflates it with group variation — effectively trying to learn 10 different shifted copies of the same curve without knowing they're shifted
- The HBNN separates concerns: the neural network learns the shared curve, and the hierarchical intercepts handle the per-group shifts
Summary: When Does Each Model Excel?
| Scenario | Best Model | Why |
|---|---|---|
| Simple structure, known form | BHM | Parsimony wins when the model matches the DGP |
| Complex signal, no structure | BNN | Flexible function class without wasted capacity |
| Complex signal + group structure | HBNN | Separates structural and functional components |
| Unknown or mixed regime | Ensemble | Insurance against misspecification |
↓ Lower CRPS = lower out-of-sample prediction error (winner)
# CRPS results (lower = better out-of-sample prediction):
#
# DGP BHM BNN HBNN Ensemble
# Grouped linear 0.31★ 1.07 0.84 0.49
# Complex nonlinear (no groups) 0.93 0.22★ 0.22★ 0.41
# Grouped + complex nonlinear 0.87 0.84 0.26★ 0.73
#
# ★ = lowest CRPS (best out-of-sample predictions)
The Takeaway
- BHM — use when the functional form is known and group structure dominates. Maximum efficiency, minimum waste.
- BNN — use when the signal is complex and there's no exploitable group structure. Let the network learn what you can't specify.
- HBNN — use when you have both: known groups and unknown functional forms. Encode what you know, learn what you don't.
- Ensemble — use when you're uncertain about the regime. Pay a small premium for robustness.
The HBNN isn't always the best model — it's the best model when the data has the structure it's designed for. That's the No Free Lunch theorem in action: every model has a niche, and the art is matching the model to the problem.
In domains like portfolio optimization, clinical trials, and marketing mix modeling, the data often has exactly this structure: known grouping variables (sectors, treatment arms, channels) with complex nonlinear effects. The HBNN architecture — encoding the structure you know and learning the rest — represents a principled middle ground between pure specification and pure flexibility.
Full Reprex
The complete R script is available at
hbnn_comparison_reprex.R.
It requires cmdstanr, scoringRules, and tidyverse. Runtime is
approximately 4–5 minutes.