Bayesian Hierarchical Models vs Neural Networks: An Ensemble Approach

The No Free Lunch theorem reminds us that no single model dominates across all possible data-generating processes. In practice, the question isn't "which model is best?" but "which model is best-matched to the structure in my data?"

This post demonstrates three complementary results with a fully reproducible R reprex:

Case 1: When the data has describable group structure, a Bayesian hierarchical model (BHM) produces lower out-of-sample prediction error.
Case 2: When the signal is complex and unstructured, a Bayesian neural network (BNN) wins.
Case 3: Their ensemble is competitive across both scenarios — hedging against model misspecification.

The key insight: for problems where structure can be described, Bayesian hierarchical models will tend to produce lower out-of-sample prediction error, because that structure constitutes significant prior information. When you can't describe the structure, flexible learners like BNNs earn their keep — and ensembling the two provides a robust hedge.

We evaluate models using the Continuous Ranked Probability Score (CRPS), a strictly proper scoring rule that assesses both calibration and sharpness of the full predictive distribution. Lower CRPS is better.

Setup

library(cmdstanr)
library(posterior)
library(scoringRules)
library(dplyr)
library(tibble)
library(ggplot2)

set.seed(42)

We compile two Stan models — a hierarchical linear model (group-varying intercepts and slopes) and a small Bayesian neural network (one hidden layer with tanh activations, no group structure).

The Two Models

Bayesian Hierarchical Model

Group-varying intercept + slope, partial pooling through a shared hyperprior. This is the standard mixed-effects specification:

parameters {
  real alpha_0;            // population intercept
  real beta_0;             // population slope
  real sigma_alpha;
  real sigma_beta;
  vector[J] alpha_raw;     // group deviations (non-centered)
  vector[J] beta_raw;
  real sigma;
}
transformed parameters {
  vector[J] alpha = alpha_0 + sigma_alpha * alpha_raw;
  vector[J] beta  = beta_0  + sigma_beta  * beta_raw;
}

Bayesian Neural Network

One hidden layer with H = 8 units, tanh activation, no group structure at all. Everything is learned from the signal in x:

parameters {
  vector[H] w1;    // input→hidden weights
  vector[H] b1;    // hidden biases
  vector[H] w2;    // hidden→output weights
  real b2;         // output bias
  real sigma;
}
model {
  ...
  for (n in 1:N) {
    vector[H] h = tanh(w1 * x[n] + b1);
    mu[n] = b2 + dot_product(w2, h);
  }
  y ~ normal(mu, sigma);
}

Case 1: Grouped Linear DGP — BHM Wins

We simulate data from exactly the generative model that the BHM encodes: J = 10 groups, each with its own intercept and slope, drawn from a population distribution:

J <- 10; n_per <- 30; N1 <- J * n_per
alpha_true <- rnorm(J, 0, 1.5)
beta_true  <- rnorm(J, 2, 0.8)

d1 <- tibble(
  g = rep(1:J, each = n_per),
  x = runif(N1, -2, 2)
) %>%
  mutate(
    mu = alpha_true[g] + beta_true[g] * x,
    y  = rnorm(n(), mu, 0.5)
  )

The BHM is well-specified: it matches the group structure and the linear functional form. The BNN has no access to group labels — it must learn everything from x alone. With 10 groups worth of systematic heterogeneity in both intercept and slope, the BNN is structurally handicapped.

Result: The BHM achieves substantially lower CRPS than the BNN. The hierarchical partial pooling efficiently borrows strength across groups — this is exactly the structure that BHMs are designed to exploit.

Case 2: Complex Nonlinear DGP — BNN Wins

Now we flip the script. The true signal is a sum of sines — no group structure at all:

d2 <- tibble(
  g = rep(1:J, length.out = N2),
  x = runif(N2, -3, 3)
) %>%
  mutate(
    mu = 2 * sin(1.5 * x) + 0.5 * cos(3 * x),
    y  = rnorm(n(), mu, 0.4)
  )

The group labels are arbitrary — there's no group-level heterogeneity in the DGP. The signal is entirely nonlinear in x. The BHM fits group-varying linear functions, which fundamentally cannot represent sin and cos. The BNN's tanh hidden layer can approximate this smooth nonlinearity.

Result: The BNN achieves lower CRPS. Its flexible function class matches the data-generating process, while the BHM wastes capacity on group structure that doesn't exist.

Case 3: Ensemble — Robust Across Both DGPs

The ensemble mixes posterior predictive draws 50/50 from each model:

# Pool posterior predictive draws: half from BHM, half from BNN
n_draws <- ncol(yrep_hier)
ens_idx <- sample(c(TRUE, FALSE), n_draws, replace = TRUE)
yrep_ens <- yrep_hier
yrep_ens[, !ens_idx] <- yrep_bnn[, !ens_idx]

This is Bayesian model averaging in its simplest form. The ensemble's predictive distribution is a mixture of the two models' posteriors.

Result: The ensemble never wins outright on either DGP — but it never badly loses either. Its CRPS sits between the two specialists in both cases. When you don't know which DGP you're facing, the ensemble is the hedged bet.

Summary: CRPS Comparison

↓ Lower CRPS = lower out-of-sample prediction error (winner)

# CRPS results (lower = better out-of-sample prediction):
#
# DGP               BHM     BNN     Ensemble
# Grouped linear     0.31★   1.07    0.49
# Complex nonlinear  0.89    0.21★   0.40
#
# ★ = lowest CRPS (best out-of-sample predictions)

The Takeaway

There's no free lunch, but the menu isn't random either:

When you can describe the structure — groups, hierarchies, known functional forms — encode it in a Bayesian hierarchical model. That structure is information, and information reduces prediction error.
When the structure is unknown or too complex to specify — use a flexible model like a BNN. Better to learn the function than to misspecify it.
When you're uncertain which regime you're in — ensemble. You pay a small premium over the best specialist, but you're insured against the worst case.

This is particularly relevant in portfolio optimization and financial modeling, where some aspects of the data (sector effects, seasonal patterns, cross-asset correlations) have known structure worth encoding, while other aspects (regime changes, tail events, nonlinear interactions) resist explicit specification. The practitioner's job is to match model complexity to problem structure — and to ensemble where structure is ambiguous.

Full Reprex

The complete, self-contained R script is available at bhm_bnn_ensemble_reprex.R. It requires cmdstanr, scoringRules, and standard tidyverse packages. Runtime is approximately 2–3 minutes on a modern laptop.

Bayesian Hierarchical Model vs Bayesian Neural Network vs Ensemble:When Structure Beats Flexibility (and Vice Versa)