Hierarchical Bayesian Neural Networks: When Structure Meets Flexibility

In a previous post, we showed that Bayesian hierarchical models (BHMs) win when group structure is describable, Bayesian neural networks (BNNs) win when the signal is complex, and their ensemble hedges across both scenarios.

But what about the case where your data has both — known group structure and complex nonlinear signal? Neither the BHM nor the BNN is ideal:

The BHM captures group heterogeneity but can't represent the nonlinear signal
The BNN learns the nonlinear function but ignores group structure

Enter the hierarchical Bayesian neural network (HBNN) — a model that combines partial pooling over groups with a flexible neural network function class. This post adds the HBNN to our comparison and shows when each of the four approaches excels.

The Four Models

Model	Group Structure	Nonlinear Flexibility
BHM	✅ Group-varying intercept + slope	❌ Linear in x
BNN	❌ No group information	✅ tanh hidden layer
HBNN	✅ Group-varying intercepts	✅ tanh hidden layer
Ensemble	Partial (via BHM component)	Partial (via BNN component)

The Hierarchical BNN in Stan

The HBNN adds partially-pooled group intercepts to the neural network. The non-centered parameterization lets Stan explore the posterior efficiently:

parameters {
  real alpha_0;               // population intercept
  real sigma_alpha;  // group intercept spread
  vector[J] alpha_raw;        // non-centered group deviations

  vector[H] w1;     // input→hidden weights
  vector[H] b1;     // hidden biases
  vector[H] w2;     // hidden→output weights
  real b2;           // output bias
  real sigma;
}
transformed parameters {
  vector[J] alpha = alpha_0 + sigma_alpha * alpha_raw;
}
model {
  ...
  for (n in 1:N) {
    vector[H] h = tanh(w1 * x[n] + b1);
    mu[n] = alpha[g[n]] + b2 + dot_product(w2, h);
  }
  y ~ normal(mu, sigma);
}

Line by line, this is a BNN with one addition: alpha[g[n]] — a group-specific intercept drawn from a learned population distribution. The neural network learns the shared nonlinear signal f(x), and the hierarchical intercepts capture group-level shifts. Each component does what it's best at.

Three Data-Generating Processes

DGP 1: Grouped Linear — BHM Excels

# Group-specific intercepts and slopes, linear signal
alpha_true <- rnorm(J, 0, 1.5)
beta_true  <- rnorm(J, 2, 0.8)
d1 <- tibble(g = rep(1:J, each = n_per), x = runif(N, -2, 2)) %>%
  mutate(mu = alpha_true[g] + beta_true[g] * x,
         y  = rnorm(n(), mu, 0.5))

The signal is linear and the heterogeneity is in group-specific intercepts and slopes. The BHM is perfectly specified — its functional form matches the DGP exactly. The HBNN's neural network component adds unnecessary flexibility here, making it a close second but slightly less efficient due to the additional parameters.

Winner: BHM. When the functional form is simple enough to specify, the hierarchical model's parsimony is an advantage.

DGP 2: Complex Nonlinear, No Groups — BNN Excels

# Nonlinear signal, group labels are arbitrary
d2 <- tibble(g = rep(1:J, length.out = N), x = runif(N, -3, 3)) %>%
  mutate(mu = 2 * sin(1.5 * x) + 0.5 * cos(3 * x),
         y  = rnorm(n(), mu, 0.4))

No group-level heterogeneity — the group labels are noise. Both the BHM and the HBNN waste capacity estimating group intercepts that don't exist, but the BNN focuses entirely on learning the nonlinear signal. The HBNN is close because partial pooling drives the unnecessary group intercepts toward zero, but the BNN's simpler parameterization gives it an edge.

Winner: BNN. When there's no group structure, the simplest flexible model wins.

DGP 3: Grouped + Complex Nonlinear — HBNN Excels

# Group offsets PLUS nonlinear signal
alpha_true3 <- rnorm(J, 0, 1.5)
d3 <- tibble(g = rep(1:J, each = n_per), x = runif(N, -3, 3)) %>%
  mutate(mu = alpha_true3[g] + 2 * sin(1.5 * x) + 0.5 * cos(3 * x),
         y  = rnorm(n(), mu, 0.4))

Now the data has both group-specific offsets and a complex nonlinear signal. This is the scenario the HBNN was built for:

The BHM handles the group offsets but can't capture the nonlinear signal
The BNN learns the nonlinear signal but conflates it with group variation — effectively trying to learn 10 different shifted copies of the same curve without knowing they're shifted
The HBNN separates concerns: the neural network learns the shared curve, and the hierarchical intercepts handle the per-group shifts

Winner: HBNN. When the data has both describable structure and complex signal, the hierarchical neural network combines both advantages and achieves the lowest prediction error.

Summary: When Does Each Model Excel?

Scenario	Best Model	Why
Simple structure, known form	BHM	Parsimony wins when the model matches the DGP
Complex signal, no structure	BNN	Flexible function class without wasted capacity
Complex signal + group structure	HBNN	Separates structural and functional components
Unknown or mixed regime	Ensemble	Insurance against misspecification

↓ Lower CRPS = lower out-of-sample prediction error (winner)

# CRPS results (lower = better out-of-sample prediction):
#
# DGP                              BHM    BNN    HBNN   Ensemble
# Grouped linear                   0.31★  1.07   0.84   0.49
# Complex nonlinear (no groups)    0.93   0.22★  0.22★  0.41
# Grouped + complex nonlinear      0.87   0.84   0.26★  0.73
#
# ★ = lowest CRPS (best out-of-sample predictions)

The Takeaway

The practitioner's job is to match model architecture to data structure:

BHM — use when the functional form is known and group structure dominates. Maximum efficiency, minimum waste.
BNN — use when the signal is complex and there's no exploitable group structure. Let the network learn what you can't specify.
HBNN — use when you have both: known groups and unknown functional forms. Encode what you know, learn what you don't.
Ensemble — use when you're uncertain about the regime. Pay a small premium for robustness.

The HBNN isn't always the best model — it's the best model when the data has the structure it's designed for. That's the No Free Lunch theorem in action: every model has a niche, and the art is matching the model to the problem.

In domains like portfolio optimization, clinical trials, and marketing mix modeling, the data often has exactly this structure: known grouping variables (sectors, treatment arms, channels) with complex nonlinear effects. The HBNN architecture — encoding the structure you know and learning the rest — represents a principled middle ground between pure specification and pure flexibility.

Full Reprex

The complete R script is available at hbnn_comparison_reprex.R. It requires cmdstanr, scoringRules, and tidyverse. Runtime is approximately 4–5 minutes.

Hierarchical Bayesian Neural Networks:When Structure Meets Flexibility